<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Beginning Python for Bioinformatics &#187; Section 5</title>
	<atom:link href="http://python.genedrift.org/category/section-5/feed/" rel="self" type="application/rss+xml" />
	<link>http://python.genedrift.org</link>
	<description>a step-by-step guide to create Python applications in bioinformatics</description>
	<lastBuildDate>Wed, 10 Mar 2010 13:03:32 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=3.0-alpha</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Obtaining overrepresented motifs in DNA sequences, part 13</title>
		<link>http://python.genedrift.org/2008/08/20/obtaining-overrepresented-motifs-in-dna-sequences-part-13/</link>
		<comments>http://python.genedrift.org/2008/08/20/obtaining-overrepresented-motifs-in-dna-sequences-part-13/#comments</comments>
		<pubDate>Thu, 21 Aug 2008 02:32:09 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Phase 2]]></category>
		<category><![CDATA[Section 3]]></category>
		<category><![CDATA[Section 5]]></category>
		<category><![CDATA[motifs]]></category>
		<category><![CDATA[defaultdict]]></category>
		<category><![CDATA[determination]]></category>
		<category><![CDATA[dna]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/?p=149</guid>
		<description><![CDATA[Now that we have the best quorum determination function and the ideal function to calculate the binomial expansions it is easy to program a script to calculate the p value of motifs in DNA sequences. To the script
below in the code there are a couple of errors that wordpress don&#8217;t let me fix. The &#62; [...]]]></description>
			<content:encoded><![CDATA[<p>Now that we have the best quorum determination function and the ideal function to calculate the <a href="http://en.wikipedia.org/wiki/Binomial_theorem" title="Binomial theorem" rel="wikipedia" class="zem_slink">binomial expansions</a> it is easy to program a script to calculate the <em>p</em> value of motifs in DNA sequences. To the script</p>
<p><em>below in the code there are a couple of errors that wordpress don&#8217;t let me fix. The <verbatim>&gt;</verbatim> and <verbatim>&lt;</verbatim> are replaced by their literal html enconding. I am working on it, sorry</em></p>
<pre name="code" class="python">
#!/usr/bin/env python

import fasta
import sys
from collections import defaultdict

def choose(n, k):
    if 0 &lt;= k &lt;= n:
        ntok = 1
        ktok = 1
        for t in xrange(1, min(k, n - k) + 1):
            ntok *= n
            ktok *= t
            n -= 1
        #print ntok // ktok
        return ntok // ktok
    else:
        return 0

def get_quorums(seqs, mlen):
    &quot;&quot;&quot;
    add seq id_no to a set
    use explicit counter to create seq_no
    &quot;&quot;&quot;
    quorum = defaultdict(set)
    id_no = 0
    for seq in seqs:
        id_no += 1
        for n in range(len(seq) - mlen):
            quorum[seq[n:n + mlen]].add(id_no)
    return quorum

input_seqs = fasta.read_seqs(open(sys.argv[1]).readlines())
input_seqs2 = fasta.read_seqs(open(sys.argv[2]).readlines())

foreground = get_quorums(input_seqs, 10)
background = get_quorums(input_seqs2, 10)

N = len(input_seqs) + len(input_seqs2)

for i in foreground:
    term1 = choose(len(background[i]), len(foreground[i]))
    term2 = choose((N - len(background[i])), len(input_seqs)-1)
    term3 = choose(N, len(input_seqs))
    p = (float(term1) * float(term2)) / term3
    if 0 &lt; p &lt;= 0.0001:
        print i, len(foreground[i]), len(background[i]), p
</pre>
<p>We already defined choose in the last post (more information in the link from the Python&#8217;s cookbook) and earlier Mike sent us a series of quorum-determination functions and one of the best was portrayed and explained <a href="http://python.genedrift.org/2008/06/03/obtaining-overrepresented-motifs-in-dna-sequences-part-7/">here</a>. We also need our fasta module to read the sequences (and only the sequences) in order to use it in the quorum function.</p>
<p>Basically we use the foreground and background files as input, determine the quorum of the different words (width 10) and then we iterate over the results, calculating the <em>p</em> value for each motif found in the foreground set. The tree terms of the Hypergeometric Distribution are calculated separately and we test for a <em>p</em> value smaller that 0.0001 (this can be modified) and we only print the results that fall in this category.&gt;</p>
<div style="margin-top: 10px; height: 15px;" class="zemanta-pixie"><a class="zemanta-pixie-a" href="http://reblog.zemanta.com/zemified/cdd03597-362b-4dcc-b588-fb3fe3fba91a/" title="Zemified by Zemanta"><img style="border: medium none ; float: right;" class="zemanta-pixie-img" src="http://img.zemanta.com/reblog_e.png?x-id=cdd03597-362b-4dcc-b588-fb3fe3fba91a" alt="Reblog this post [with Zemanta]"></a></div>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2008/08/20/obtaining-overrepresented-motifs-in-dna-sequences-part-13/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Translating DNA into proteins: second approach, now using FASTA files</title>
		<link>http://python.genedrift.org/2007/07/11/translating-dna-into-proteins-second-approach-now-using-fasta-files/</link>
		<comments>http://python.genedrift.org/2007/07/11/translating-dna-into-proteins-second-approach-now-using-fasta-files/#comments</comments>
		<pubDate>Wed, 11 Jul 2007 20:10:09 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 5]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/2007/07/11/translating-dna-into-proteins-second-approach-now-using-fasta-files/</guid>
		<description><![CDATA[We have seen before how to translate DNA sequences into amino acids sequences. We have even created a module that contains the dictionary for the genetic code. Now we are going to combine both (very simple) modules we created in one nice script for day-to-day use.
So, we have the dnatranslate.py and the fasta.py that we [...]]]></description>
			<content:encoded><![CDATA[<p>We have seen before how to translate DNA sequences into amino acids sequences. We have even created a module that contains the dictionary for the genetic code. Now we are going to combine both (very simple) modules we created in one nice script for day-to-day use.</p>
<p>So, we have the <a href="http://python.genedrift.org/code/dnatranslate.py">dnatranslate.py</a> and the <a href="http://python.genedrift.org/code/fasta.py">fasta.py</a> that we are going to import into our script. And that&#8217;s basically it: calling function already created, stored in modules that can be reused anytime. In the end our script that translates DNA sequences to proteins takes a little bit more than a handful of lines.</p>
<pre name="code" class="python">
#!/usr/bin/env python

import dnatranslate
import sys
import fasta

dna = fasta.read_fasta(open(sys.argv[1], &#039;r&#039;).readlines())

for item in dna:
    protein = dnatranslate.translate_dna(item.sequence)
    print item.name
    print protein
</pre>
<p>That&#8217;s it. A good example of reusable code, that once created fits everywhere and handles most type of data. We read the FASTA file in the first line, and iterate over the items created translating them as we go. As an extra exercise, we can include the output formatting function. First we need to update the <code>fasta.py</code> module (already on the repository) and slightly change the formatting function, that ends up looking like this</p>
<pre name="code" class="python">
def format_output(sequence, length):
    temp = []
    for j in range(0,len(sequence),length):
        temp.append(sequence[j:j+length])
    return &#039;\n&#039;.join(temp)
</pre>
<p>For this case the ideal formatting function would go through the &#8220;longer&#8221; route mentioned before, because the final printing should be done by the main script and not by the imported module. This gives us more control on what we want to do with the resulting string. The <code>format_output</code> function receives two arguments: the first is the actual DNA/protein sequence to be formatted and the length we want to output it. We had to remove the loop too, so only one sequence can be sent to the function and, as pointed, the function returns a string with the formatted sequence. In the end our post&#8217;s initial sequence has one modification only</p>
<pre name="code" class="python">
#!/usr/bin/env python

import dnatranslate
import sys
import fasta

dna = fasta.read_fasta(open(sys.argv[1], &#039;r&#039;).readlines())

for item in dna:
    protein = dnatranslate.translate_dna(item.sequence)
    print item.name
    print fasta.format_output(protein, 60)
</pre>
<p>the last line, that instead of printing directly the result of the translation, sends the sequence to the formatting function before.</p>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2007/07/11/translating-dna-into-proteins-second-approach-now-using-fasta-files/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Formatting output of FASTA files</title>
		<link>http://python.genedrift.org/2007/07/10/formatting-output-of-fasta-files/</link>
		<comments>http://python.genedrift.org/2007/07/10/formatting-output-of-fasta-files/#comments</comments>
		<pubDate>Tue, 10 Jul 2007 16:30:15 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 5]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/2007/07/10/formatting-output-of-fasta-files/</guid>
		<description><![CDATA[The Beginning Perl for Bioinformatics book shows a script to print formatted sequence data, specifying that no more than 80 characters (either nucleotides or amino acids) should be printed across a page. Here, we will see a similar script in Python and will include it in our fasta.py module in order to use it a [...]]]></description>
			<content:encoded><![CDATA[<p>The Beginning Perl for Bioinformatics book shows a script to print formatted sequence data, specifying that no more than 80 characters (either nucleotides or amino acids) should be printed across a page. Here, we will see a similar script in Python and will include it in our <code>fasta.py</code> module in order to use it a default output, taking advantage of its usability.</p>
<p>Basically, our FASTA reader strips all the carriage returns/new lines (if existent) from the input file, so the sequence itself is stored continuously in a string. Of course we could do it differently and store the sequence as is in the file, but then they might differ on the number of characters per line and other features. Our goal then is to <i>break</i> the sequence in chunks, making it look like a justified paragraph. We use the code below to obtain this, back after it.</p>
<pre name="code" class="python">
#!/usr/bin/env python

import fasta
import sys

sequences = fasta.read_fasta(open(sys.argv[1], &#039;r&#039;).readlines())

temp = []
for i in sequences:
    print i.name
    for j in range(0,len(i.sequence),80):
        temp.append(i.sequence[j:j+80])
    print &#039;\n&#039;.join(temp)
    temp = []
</pre>
<p>The script, as always, is very simple and does the job. A new feature of the range method is introduced: the <i>step</i>. From previous posts we have seen that function range generates a sequence of integers based either on one value </p>
<pre name="code" class="python">
&gt;&gt;&gt;range(5)
&gt;&gt;&gt;[0, 1, 2, 3, 4]
</pre>
<p>or two values</p>
<pre name="code" class="python">
&gt;&gt;&gt;range(10,15)
&gt;&gt;&gt;[10, 11, 12, 13, 14]
</pre>
<p>The <i>step</i> parameter makes the generated to jump a certain number of values. </p>
<pre name="code" class="python">
&gt;&gt;&gt;range(10, 20, 2)
&gt;&gt;&gt;[10, 12, 14, 16, 18]
</pre>
<p>This is similar to the <code>for</code> loop in C/C++ where the loop is define by three parameters on a construction like this</p>
<pre name="code" class="python">
for (i = 0; i &lt; 10; i++)
</pre>
<p>where the first parameter sets the initial value of <code>i</code>, the second value sets the maximum value of <code>i</code> and the last one sets the <i>step</i> that <code>i</code> will be incremented. </p>
<p>In our small script above, we read the sequence (this time putting everything in one line), get the sequences and then iterate along the instances of our class, printing first the name of the sequence. Then we iterate on the whole length of each sequence, this time generating a range that starts on zero and goes up to the final nucleotide/amino acid and has a <i>step</i> that defines the number of characters we want to print in one line.</p>
<p>To print the sequence to the screen we had two options: directly print using </p>
<pre name="code" class="python">
print i.sequence[j:j+80]
</pre>
<p>or use the longer route shown in our script. Trying this script will output the sequences from the original file with 80 characters along the line. Next following the book we will write a script to translate DNA to protein using our FASTA reading function and modify our FASTA module to include the formatted output function.</p>
<p><em>Please notice that the code uploaded to the repository contains the shortest route, just to show the differences</em></p>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2007/07/10/formatting-output-of-fasta-files/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Reading FASTA files: conclusion</title>
		<link>http://python.genedrift.org/2007/07/04/reading-fasta-files-conclusion/</link>
		<comments>http://python.genedrift.org/2007/07/04/reading-fasta-files-conclusion/#comments</comments>
		<pubDate>Wed, 04 Jul 2007 20:36:15 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 5]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/2007/07/04/reading-fasta-files-conclusion/</guid>
		<description><![CDATA[In the previous entry we have seen how to read a FASTA file and display a simple output. The ability of reading such files in Bioinformatics is extremely relevant, so from now on most of our scripts that deal with sequences will use this feature. In the repository there is a link to the fasta.py [...]]]></description>
			<content:encoded><![CDATA[<p>In the previous entry we have seen how to read a FASTA file and display a simple output. The ability of reading such files in Bioinformatics is extremely relevant, so from now on most of our scripts that deal with sequences will use this feature. In the repository there is a link to the <code>fasta.py</code> which contains the code below </p>
<pre name="code" class="python">
class Fasta:
    def __init__(self, name, sequence):
        self.name = name
        self.sequence = sequence

def read_fasta(file):
    items = []
    index = 0
    for line in file:
        if line.startswith(&quot;&gt;&quot;):
           if index &gt;= 1:
               items.append(aninstance)
           index+=1
           name = line[:-1]
           seq = &#039;&#039;
           aninstance = Fasta(name, seq)
        else:
           seq += line[:-1]
           aninstance = Fasta(name, seq)

    items.append(aninstance)
    return items
</pre>
<p>We are going to reuse this code several times, and now we only need to import by using</p>
<pre name="code" class="python">
import fasta
</pre>
<p>and we are done. This will make everything easier. Please, be aware in the following posts the use of this simple module and download it from the repository if needed.</p>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2007/07/04/reading-fasta-files-conclusion/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Reading FASTA files</title>
		<link>http://python.genedrift.org/2007/07/04/reading-fasta-files-2/</link>
		<comments>http://python.genedrift.org/2007/07/04/reading-fasta-files-2/#comments</comments>
		<pubDate>Wed, 04 Jul 2007 17:33:41 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 5]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/2007/07/04/reading-fasta-files-2/</guid>
		<description><![CDATA[We now have an idea on how to create a class in Python. For our FASTA reader we will use a very similar approach as the employee class created before. Based on the simplicity of the FASTA format it is easy to see what attributes we need in our class: sequence title (header) and sequence. [...]]]></description>
			<content:encoded><![CDATA[<p>We now have an idea on how to create a class in Python. For our FASTA reader we will use a very similar approach as the employee class created before. Based on the simplicity of the FASTA format it is easy to see what attributes we need in our class: sequence title (header) and sequence. That&#8217;s it. Speeding up, our class will look like this</p>
<pre name="code" class="python">
class Fasta:
    def __init__(self, name, sequence):
        self.name = name
        self.sequence = sequence
</pre>
<p>That&#8217;s all we need for now. Of course we could include sequence length, sequence type, or any other relevant information as a class attribute. But we will stick with those for the time being. NOw that we have the class declared, we need to read the file and create instances of the <code>Fasta</code> class to store all our sequences.</p>
<p>We already know how to open and read a file and create functions. We put everything together in a simple script:</p>
<pre name="code" class="python">
import sys
#class declaration with both attributes we need
class Fasta:
    def __init__(self, name, sequence):
        #this will store the sequence name
        self.name = name
        #this  will store the sequence itself
        self.sequence = sequence

#this function will receive the list with the file
#contents, create instances of the Fasta class as
#it scans the list, putting the sequence name on the
#first attribute and the sequence itself on the second
#attribute
def read_fasta(file):
    #we declare an empty list that will store all
    #Fasta class instances generated
    items = []
    index = 0
    for line in file:
    #we check to see if the line starts with a &gt; sign
        if line.startswith(&quot;&gt;&quot;):
           #if so and our counter is large than 1
           #we add the created class instance to our list
           #a counter larger than 1 means we are reading
           #from sequences 2 and above
           if index &gt;= 1:
               items.append(aninstance)
           index+=1
           #we add the line contents to a string
           name = line[:-1]
           #and initialize the string to store the sequence
           seq = &#039;&#039;
           #this creates a class instance and we add the attributes
           #which are the strings name and seq
           aninstance = Fasta(name, seq)
        else:
           #the line does not start with &gt; so it has to be
           #a sequence line, so we increment the string and
           #add it to the created instance
            seq += line[:-1]
            aninstance = Fasta(name, seq)

    #the loop before reads everything but the penultimate
    #sequence is added at the end, so we need to add it
    #after the loop ends
    items.append(aninstance)
    #a list with all read sequences is returned
    return items

fastafile = open(sys.argv[1], &#039;r&#039;).readlines()
mysequences = read_fasta(fastafile)

print mysequences

for i in mysequences:
    print i.name
</pre>
<p>At first, it looks scary. But is not. There are many ways to create this loop and to read the sequences, and many ways to make this loop shorter. We will get there, eventually, but for starters this is OK. Basically, in the above script we have a <code>Fasta</code> class, a <code>read_fasta</code> function and a couple of lines to read a print the results. The <code>read_fasta</code> function basically checks all items of a list and see which is the first character of each item: a <code>></code> sign indicates that the item should be parsed to the temporary name string, another character redirects the item to the sequence string. Instances are created on the fly and the attributes assigned as what was file content is being scanned. </p>
<p>In the end, just to be sure of what we accomplished with the function we print the list and loop through it and print the name of each sequence read. Notice that the loop counter/index is the one that receives the class attribute.</p>
<p>The biggest advantage of this way of reading FASTA files is that the class and the function are reusable. Basically we can create a file (for instance called readfasta.py) and import into another script and we have a FASTA reader ready to rock. We would need to tweak a little bit in order to catch exceptions, but with consistent FASTA files this would work fine.</p>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2007/07/04/reading-fasta-files-2/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Reading FASTA files: introduction</title>
		<link>http://python.genedrift.org/2007/07/03/reading-fasta-files/</link>
		<comments>http://python.genedrift.org/2007/07/03/reading-fasta-files/#comments</comments>
		<pubDate>Tue, 03 Jul 2007 21:39:37 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 5]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/2007/07/03/reading-fasta-files/</guid>
		<description><![CDATA[Again after a long period we are back. We already have most of the knowledge to create very useful scripts and small programs in Python. And in this post we will create a routine that will make the biologist&#8217;s life even easier (when programming Python).
A great part of bioinformatics is to store data, how to [...]]]></description>
			<content:encoded><![CDATA[<p>Again after a long period we are back. We already have most of the knowledge to create very useful scripts and small programs in Python. And in this post we will create a routine that will make the biologist&#8217;s life even easier (when programming Python).</p>
<p>A great part of bioinformatics is to store data, how to store it and which format to use. Anyone working in a wetlab or on a pure bioinformatics lab had had problems with file formats one day or another. In fact, you don&#8217;t have to be in a bioinformatics environment to have such problems, but in biology people tend to create a file format for everyt program they write. We can call it a lack of standards, sometimes. But one very well established format is the FASTA (pronounced FAST-Aye, according to the EMBl-EBI page for the software with identical name. This format is really simple and easy to manipulate with most computer languages and being text-based adds an extra advantage of portability for the files. Usually a FASTA file has a structure like this</p>
<p>><i>title/name/extra information about the sequence</i><br />
<i>sequence in one or many lines</i></p>
<p>There is no limit on the number of sequences that can be stored in a file, neither on the size of each sequence. Usually sequences that are larger than 70-80 nucleotides/amino acids are displayed in multiple lines.</p>
<p>>Q15465|SHH_HUMAN Sonic hedgehog protein &#8211; Homo sapiens (Human).<br />
MLLLARCLLLVLVSSLLVCSGLACGPGRGFGKRRHPKKLTPLAYKQFIPNVAEKTLGASG<br />
RYEGKISRNSERFKELTPNYNPDIIFKDEENTGADRLMTQRCKDKLNALAISVMNQWPGV<br />
KLRVTEGWDEDGHHSEESLHYEGRAVDITTSDRDRSKYGMLARLAVEAGFDWVYYESKAH<br />
IHCSVKAENSVAAKSGGCFPGSATVHLEQGGTKLVKDLSPGDRVLAADDQGRLLYSDFLT<br />
FLDRDDGAKKVFYVIETREPRERLLLTAAHLLFVAPHNDSATGEPEASSGSGPPSGGALG<br />
PRALFASRVRPGQRVYVVAERDGDRRLLPAAVHSVTLSEEAAGAYAPLTAQGTILINRVL<br />
ASCYAVIEEHSWAHRAFAPFRLAHALLAALAPARTDRGGDSGGGDRGGGGGRVALTAPGA<br />
ADAPGAGATAGIHWYSQLLYQIGTWLLDSEALHPLGMAVKSS</p>
<p>So, let&#8217;s write a function and a small script to read a typical FASTA file and display the output. Later we will see how to manipulate the file and in the end we will have created a list of scripts that will be useful for the everyday laboratory life.</p>
<p>But before we get to our final goal, we need to learn some new features of Python. In our case <b>Classes</b>. Classes in Python are very similar to classes in C++, and they are the building blocks of Object Oriented Programming (OOP). This guide will only scratch the surface of OOP, as it is a very complex subject, requiring a guide of its own. There are plenty of good introductory material online and any Google search will return dozens of links.</p>
<p>Let&#8217;s focus on some basic concepts, what will be exactly what we will need here. Classes are basically objects with associated properties (attributes) and methods. A traditional introductory exampple of classes is the employee list. In a company all employees are registered and they basic information are stored in the human resources file system. Let&#8217;s call the main class <code>Employee</code> with three attributes: <code>name</code>, <code>room_number</code> and <code>favourite_colour</code>. This defines a class. Let&#8217;s see how to do that in Python:</p>
<pre name="code" class="python">
class Employee:
    def __init__(self, name, room, colour):
        self.name = name
        self.room = room
        self.colour = colour
</pre>
<p>Let&#8217;s dissect this piece of code. First we declare a class and give a name to it <code>Employee</code>. The next thing we need to do is to define the initiation (constructor) method of the class and give it the attributes we need, that&#8217;s the <code>__init__</code> method. Everytime we create a copy of the main class object the copy will be initialized with the values we are passing to the method. According to Dive into Python &#8220;The first argument of every class method, including __init__, is always a reference to the current instance of the class.&#8221;. That&#8217;s the <code>self</code> on the method definition, which is followed by what we want to store in the object: name, room and favourite colour. The lines in the method are the ones assigning the received values to the different class&#8217; attributes. Again from Dive into Python we have &#8220;To reference this attribute from code outside the class, you qualify it with the instance name, instance.data, in the same way that you qualify a function with its module name. To reference a data attribute from within the class, you use self as the qualifier.&#8221;. In other words the attributes of the class are seen internally by it by the use of <code>self</code> while from the outside (another part of the script) it is seen by the name given to the instance of the class. It is a little bit confusing, so let&#8217;s see an example below.</p>
<pre name="code" class="python">
class Employee:
    def __init__(self, name, room, colour):
        self.name = name
        self.room = room
        self.colour = colour

employeename = &quot;Paulo&quot;
roomnumber = &quot;21&quot;
colour = &quot;blue&quot;

newemployee = Employee(employeename, roomnumber, colour)
print newemployee
</pre>
<p>If we run this simple code we will get something like this</p>
<pre name="code" class="python">
&lt;__main__.Employee instance at 0x7379cc&gt;
</pre>
<p>And that&#8217;s exactly what we need here: that&#8217;s a class instance that was created by </p>
<pre name="code" class="python">
newemployee = Employee(employeename, roomnumber, colour)
</pre>
<p>Printing this information is not very useful. We don&#8217;t want to know the memory address of the object, we want to get the attributes of it, so we need to change <code>print newemployee</code> by these lines:</p>
<pre name="code" class="python">
print newemployee.name
print newemployee.room
print newemployee.colour
</pre>
<p>There they are, run it and see what is being printed to the screen. Of course, as is this script does not accomplish anything useful, but imagine that you need to read a file with 1000 new employees everyt month. That&#8217;s what we need to do with FASTA files everyday and we will see how to do that on the next post.</p>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2007/07/03/reading-fasta-files/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Translating DNA into proteins</title>
		<link>http://python.genedrift.org/2007/05/24/translating-dna-into-proteins/</link>
		<comments>http://python.genedrift.org/2007/05/24/translating-dna-into-proteins/#comments</comments>
		<pubDate>Thu, 24 May 2007 21:19:10 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 5]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/2007/05/24/translating-dna-into-proteins/</guid>
		<description><![CDATA[After a long time away, I am back. Things have been hectic in the lab and there is a shortage of free time to do everything else. Let&#8217;s get back to Python.
Last time we have seen a dictionary structure that contained our genetic code. We are going to see a script that will translate DNA [...]]]></description>
			<content:encoded><![CDATA[<p>After a long time away, I am back. Things have been hectic in the lab and there is a shortage of free time to do everything else. Let&#8217;s get back to Python.</p>
<p>Last time we have seen a dictionary structure that contained our genetic code. We are going to see a script that will translate DNA into proteins and we are also going to see how to create a module in Python that can be imported in any script. This module will contain the function that translates the DNA, and for the time being only that.</p>
<p>Let&#8217;s see how our module looks like:</p>
<pre name="code" class="python">
def translate_dna(sequence):

    gencode = {
    &#039;ATA&#039;:&#039;I&#039;, &#039;ATC&#039;:&#039;I&#039;, &#039;ATT&#039;:&#039;I&#039;, &#039;ATG&#039;:&#039;M&#039;,
    &#039;ACA&#039;:&#039;T&#039;, &#039;ACC&#039;:&#039;T&#039;, &#039;ACG&#039;:&#039;T&#039;, &#039;ACT&#039;:&#039;T&#039;,
    &#039;AAC&#039;:&#039;N&#039;, &#039;AAT&#039;:&#039;N&#039;, &#039;AAA&#039;:&#039;K&#039;, &#039;AAG&#039;:&#039;K&#039;,
    &#039;AGC&#039;:&#039;S&#039;, &#039;AGT&#039;:&#039;S&#039;, &#039;AGA&#039;:&#039;R&#039;, &#039;AGG&#039;:&#039;R&#039;,
    &#039;CTA&#039;:&#039;L&#039;, &#039;CTC&#039;:&#039;L&#039;, &#039;CTG&#039;:&#039;L&#039;, &#039;CTT&#039;:&#039;L&#039;,
    &#039;CCA&#039;:&#039;P&#039;, &#039;CCC&#039;:&#039;P&#039;, &#039;CCG&#039;:&#039;P&#039;, &#039;CCT&#039;:&#039;P&#039;,
    &#039;CAC&#039;:&#039;H&#039;, &#039;CAT&#039;:&#039;H&#039;, &#039;CAA&#039;:&#039;Q&#039;, &#039;CAG&#039;:&#039;Q&#039;,
    &#039;CGA&#039;:&#039;R&#039;, &#039;CGC&#039;:&#039;R&#039;, &#039;CGG&#039;:&#039;R&#039;, &#039;CGT&#039;:&#039;R&#039;,
    &#039;GTA&#039;:&#039;V&#039;, &#039;GTC&#039;:&#039;V&#039;, &#039;GTG&#039;:&#039;V&#039;, &#039;GTT&#039;:&#039;V&#039;,
    &#039;GCA&#039;:&#039;A&#039;, &#039;GCC&#039;:&#039;A&#039;, &#039;GCG&#039;:&#039;A&#039;, &#039;GCT&#039;:&#039;A&#039;,
    &#039;GAC&#039;:&#039;D&#039;, &#039;GAT&#039;:&#039;D&#039;, &#039;GAA&#039;:&#039;E&#039;, &#039;GAG&#039;:&#039;E&#039;,
    &#039;GGA&#039;:&#039;G&#039;, &#039;GGC&#039;:&#039;G&#039;, &#039;GGG&#039;:&#039;G&#039;, &#039;GGT&#039;:&#039;G&#039;,
    &#039;TCA&#039;:&#039;S&#039;, &#039;TCC&#039;:&#039;S&#039;, &#039;TCG&#039;:&#039;S&#039;, &#039;TCT&#039;:&#039;S&#039;,
    &#039;TTC&#039;:&#039;F&#039;, &#039;TTT&#039;:&#039;F&#039;, &#039;TTA&#039;:&#039;L&#039;, &#039;TTG&#039;:&#039;L&#039;,
    &#039;TAC&#039;:&#039;Y&#039;, &#039;TAT&#039;:&#039;Y&#039;, &#039;TAA&#039;:&#039;_&#039;, &#039;TAG&#039;:&#039;_&#039;,
    &#039;TGC&#039;:&#039;C&#039;, &#039;TGT&#039;:&#039;C&#039;, &#039;TGA&#039;:&#039;_&#039;, &#039;TGG&#039;:&#039;W&#039;,
    }

    print sequence
    proteinseq = &#039;&#039;
    for n in range(0,len(sequence),3):
        if gencode.has_key(sequence[n:n+3]) == True:
            proteinseq += gencode[sequence[n:n+3]]

    return proteinseq
</pre>
<p>One function, called translate_dna, that receives a DNA sequence and outputs a protein sequence. It is a &#8220;long&#8221; function because we have in the middle the genetic code in Python&#8217;s dictionary format. Our translation loop is very simple, it reads the DNA sequence three nucleotides at a time</p>
<pre name="code" class="python">
for n in range(0,len(sequence),3):
</pre>
<p>means that it will loop from 0 to the size of the DNA sequence in steps of 3, so basically we start at 0 and then jumping directly to 3. This is done to obey the translation structure based on codons of three nucleotides. Sometimes the DNA sequence entered does not have a size multiple of three and that&#8217;s the reason we use an error checking before accessing the dictionary </p>
<pre name="code" class="python">
if gencode.has_key(sequence[n:n+3]) == True:
</pre>
<p>This will test for any possible error, or codons that are smaller than three nucleotides. If the key exists it is returned and addes to our protein string. The code that will use this function is this:</p>
<pre name="code" class="python">
#!/usr/bin/env python

import dnatranslate

dnafile = open(&quot;AY162388.seq&quot;, &#039;r&#039;).readlines()

sequence = &#039;&#039;
for line in dnafile:
    sequence += line.strip()

protein = dnatranslate.translate_dna(sequence)

print sequence, len(sequence)
print
print protein, len(protein)
</pre>
<p>No secrets or new things here. Just notice that we import a module which is not part of the common Python modules, but was created by us. In this case the identification in the import will be the name of the <code>.py</code> file that contains the function(s) we are going to use. This file also needs to be located in the same directory of the script, if not installed in the Python libraries/modules directory. Notice that to use the function we need to call</p>
<pre name="code" class="python">
protein = dnatranslate.translate_dna(sequence)
</pre>
<p>as we would do with the <code>sys</code> or the <code>re</code> modules. We can now create any modules that will contain different functions that can be reused anytime without need of extra coding. For instance, someone can create a module that would read a FASTA file and return sequences and sequence names in string or a list and send it to anyone interested in the same functionality. Everything we would need to do is to have this file installed in Python or in the same directory of our script and we would take advantage of all functionality contained in the module. That easy.</p>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2007/05/24/translating-dna-into-proteins/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Genetic code: part I</title>
		<link>http://python.genedrift.org/2007/04/19/genetic-code-part-i/</link>
		<comments>http://python.genedrift.org/2007/04/19/genetic-code-part-i/#comments</comments>
		<pubDate>Thu, 19 Apr 2007 21:53:29 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 5]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/2007/04/19/genetic-code-part-i/</guid>
		<description><![CDATA[The Python dictionary data-type is like hash in Perl. It is a very similar structure, where each element in the variable is composed of two values, more specifically a key-value pair. This is the ideal data type to store the genetic code. As you might know the genetic code governs the translation of DNA into [...]]]></description>
			<content:encoded><![CDATA[<p>The Python dictionary data-type is like hash in Perl. It is a very similar structure, where each element in the variable is composed of two values, more specifically a key-value pair. This is the ideal data type to store the genetic code. As you might know the genetic code governs the translation of DNA into proteins, where each codon (3 bases or nucleotides in the DNA sequence) correspond to an amino acid in the protein sequence. So for every sequence of 3 nucleotides (key) will represent an amino acid (value). Important things: dictionaries do not accept duplicated key values, and every time a new value is assigned to a key the old value is erased. To create a new dictionary use the curly brackets</p>
<pre name="code" class="python">
first_dictionary = {}
</pre>
<p>inside the curly braces we first assign a key and separated by a colon (:), while multiple pairs should be separated by comma. Both key and value have to be between single or double quotes. Let&#8217;s see how we will represent the genetic code in a Python dictionary, assigning values to keys</p>
<pre name="code" class="python">
gencode = {
    &#039;ATA&#039;:&#039;I&#039;,    #Isoleucine
    &#039;ATC&#039;:&#039;I&#039;,    #Isoleucine
    &#039;ATT&#039;:&#039;I&#039;,    # Isoleucine
    &#039;ATG&#039;:&#039;M&#039;,    # Methionine
    &#039;ACA&#039;:&#039;T&#039;,    # Threonine
    &#039;ACC&#039;:&#039;T&#039;,    # Threonine
    &#039;ACG&#039;:&#039;T&#039;,    # Threonine
    &#039;ACT&#039;:&#039;T&#039;,    # Threonine
    &#039;AAC&#039;:&#039;N&#039;,    # Asparagine
    &#039;AAT&#039;:&#039;N&#039;,    # Asparagine
    &#039;AAA&#039;:&#039;K&#039;,    # Lysine
    &#039;AAG&#039;:&#039;K&#039;,    # Lysine
    &#039;AGC&#039;:&#039;S&#039;,    # Serine
    &#039;AGT&#039;:&#039;S&#039;,    # Serine
    &#039;AGA&#039;:&#039;R&#039;,    # Arginine
    &#039;AGG&#039;:&#039;R&#039;,    # Arginine
    &#039;CTA&#039;:&#039;L&#039;,    # Leucine
    &#039;CTC&#039;:&#039;L&#039;,    # Leucine
    &#039;CTG&#039;:&#039;L&#039;,    # Leucine
    &#039;CTT&#039;:&#039;L&#039;,    # Leucine
    &#039;CCA&#039;:&#039;P&#039;,    # Proline
    &#039;CCC&#039;:&#039;P&#039;,    # Proline
    &#039;CCG&#039;:&#039;P&#039;,    # Proline
    &#039;CCT&#039;:&#039;P&#039;,    # Proline
    &#039;CAC&#039;:&#039;H&#039;,    # Histidine
    &#039;CAT&#039;:&#039;H&#039;,    # Histidine
    &#039;CAA&#039;:&#039;Q&#039;,    # Glutamine
    &#039;CAG&#039;:&#039;Q&#039;,    # Glutamine
    &#039;CGA&#039;:&#039;R&#039;,    # Arginine
    &#039;CGC&#039;:&#039;R&#039;,    # Arginine
    &#039;CGG&#039;:&#039;R&#039;,    # Arginine
    &#039;CGT&#039;:&#039;R&#039;,    # Arginine
    &#039;GTA&#039;:&#039;V&#039;,    # Valine
    &#039;GTC&#039;:&#039;V&#039;,    # Valine
    &#039;GTG&#039;:&#039;V&#039;,    # Valine
    &#039;GTT&#039;:&#039;V&#039;,    # Valine
    &#039;GCA&#039;:&#039;A&#039;,    # Alanine
    &#039;GCC&#039;:&#039;A&#039;,    # Alanine
    &#039;GCG&#039;:&#039;A&#039;,    # Alanine
    &#039;GCT&#039;:&#039;A&#039;,    # Alanine
    &#039;GAC&#039;:&#039;D&#039;,    # Aspartic Acid
    &#039;GAT&#039;:&#039;D&#039;,    # Aspartic Acid
    &#039;GAA&#039;:&#039;E&#039;,    # Glutamic Acid
    &#039;GAG&#039;:&#039;E&#039;,    # Glutamic Acid
    &#039;GGA&#039;:&#039;G&#039;,    # Glycine
    &#039;GGC&#039;:&#039;G&#039;,    # Glycine
    &#039;GGG&#039;:&#039;G&#039;,    # Glycine
    &#039;GGT&#039;:&#039;G&#039;,    # Glycine
    &#039;TCA&#039;:&#039;S&#039;,    # Serine
    &#039;TCC&#039;:&#039;S&#039;,    # Serine
    &#039;TCG&#039;:&#039;S&#039;,    # Serine
    &#039;TCT&#039;:&#039;S&#039;,    # Serine
    &#039;TTC&#039;:&#039;F&#039;,    # Phenylalanine
    &#039;TTT&#039;:&#039;F&#039;,    # Phenylalanine
    &#039;TTA&#039;:&#039;L&#039;,    # Leucine
    &#039;TTG&#039;:&#039;L&#039;,    # Leucine
    &#039;TAC&#039;:&#039;Y&#039;,    # Tyrosine
    &#039;TAT&#039;:&#039;Y&#039;,    # Tyrosine
    &#039;TAA&#039;:&#039;_&#039;,    # Stop
    &#039;TAG&#039;:&#039;_&#039;,    # Stop
    &#039;TGC&#039;:&#039;C&#039;,    # Cysteine
    &#039;TGT&#039;:&#039;C&#039;,    # Cysteine
    &#039;TGA&#039;:&#039;_&#039;,    # Stop
    &#039;TGG&#039;:&#039;W&#039;,    # Tryptophan
}
</pre>
<p>Simple, yet efficient. But this is the type of functionality that would be great to have at hand everytime you write a script to translate DNA into proteins. And it is not something that you would like to type (or even copy-and-paste) all the time. On the next post we will create the translation script and will also create our first Python module.</p>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2007/04/19/genetic-code-part-i/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

