<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Beginning Python for Bioinformatics &#187; Section 3</title>
	<atom:link href="http://python.genedrift.org/category/section-3/feed/" rel="self" type="application/rss+xml" />
	<link>http://python.genedrift.org</link>
	<description>a step-by-step guide to create Python applications in bioinformatics</description>
	<lastBuildDate>Wed, 10 Mar 2010 13:03:32 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=3.0-alpha</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Obtaining overrepresented motifs in DNA sequences, part 13</title>
		<link>http://python.genedrift.org/2008/08/20/obtaining-overrepresented-motifs-in-dna-sequences-part-13/</link>
		<comments>http://python.genedrift.org/2008/08/20/obtaining-overrepresented-motifs-in-dna-sequences-part-13/#comments</comments>
		<pubDate>Thu, 21 Aug 2008 02:32:09 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Phase 2]]></category>
		<category><![CDATA[Section 3]]></category>
		<category><![CDATA[Section 5]]></category>
		<category><![CDATA[motifs]]></category>
		<category><![CDATA[defaultdict]]></category>
		<category><![CDATA[determination]]></category>
		<category><![CDATA[dna]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/?p=149</guid>
		<description><![CDATA[Now that we have the best quorum determination function and the ideal function to calculate the binomial expansions it is easy to program a script to calculate the p value of motifs in DNA sequences. To the script
below in the code there are a couple of errors that wordpress don&#8217;t let me fix. The &#62; [...]]]></description>
			<content:encoded><![CDATA[<p>Now that we have the best quorum determination function and the ideal function to calculate the <a href="http://en.wikipedia.org/wiki/Binomial_theorem" title="Binomial theorem" rel="wikipedia" class="zem_slink">binomial expansions</a> it is easy to program a script to calculate the <em>p</em> value of motifs in DNA sequences. To the script</p>
<p><em>below in the code there are a couple of errors that wordpress don&#8217;t let me fix. The <verbatim>&gt;</verbatim> and <verbatim>&lt;</verbatim> are replaced by their literal html enconding. I am working on it, sorry</em></p>
<pre name="code" class="python">
#!/usr/bin/env python

import fasta
import sys
from collections import defaultdict

def choose(n, k):
    if 0 &lt;= k &lt;= n:
        ntok = 1
        ktok = 1
        for t in xrange(1, min(k, n - k) + 1):
            ntok *= n
            ktok *= t
            n -= 1
        #print ntok // ktok
        return ntok // ktok
    else:
        return 0

def get_quorums(seqs, mlen):
    &quot;&quot;&quot;
    add seq id_no to a set
    use explicit counter to create seq_no
    &quot;&quot;&quot;
    quorum = defaultdict(set)
    id_no = 0
    for seq in seqs:
        id_no += 1
        for n in range(len(seq) - mlen):
            quorum[seq[n:n + mlen]].add(id_no)
    return quorum

input_seqs = fasta.read_seqs(open(sys.argv[1]).readlines())
input_seqs2 = fasta.read_seqs(open(sys.argv[2]).readlines())

foreground = get_quorums(input_seqs, 10)
background = get_quorums(input_seqs2, 10)

N = len(input_seqs) + len(input_seqs2)

for i in foreground:
    term1 = choose(len(background[i]), len(foreground[i]))
    term2 = choose((N - len(background[i])), len(input_seqs)-1)
    term3 = choose(N, len(input_seqs))
    p = (float(term1) * float(term2)) / term3
    if 0 &lt; p &lt;= 0.0001:
        print i, len(foreground[i]), len(background[i]), p
</pre>
<p>We already defined choose in the last post (more information in the link from the Python&#8217;s cookbook) and earlier Mike sent us a series of quorum-determination functions and one of the best was portrayed and explained <a href="http://python.genedrift.org/2008/06/03/obtaining-overrepresented-motifs-in-dna-sequences-part-7/">here</a>. We also need our fasta module to read the sequences (and only the sequences) in order to use it in the quorum function.</p>
<p>Basically we use the foreground and background files as input, determine the quorum of the different words (width 10) and then we iterate over the results, calculating the <em>p</em> value for each motif found in the foreground set. The tree terms of the Hypergeometric Distribution are calculated separately and we test for a <em>p</em> value smaller that 0.0001 (this can be modified) and we only print the results that fall in this category.&gt;</p>
<div style="margin-top: 10px; height: 15px;" class="zemanta-pixie"><a class="zemanta-pixie-a" href="http://reblog.zemanta.com/zemified/cdd03597-362b-4dcc-b588-fb3fe3fba91a/" title="Zemified by Zemanta"><img style="border: medium none ; float: right;" class="zemanta-pixie-img" src="http://img.zemanta.com/reblog_e.png?x-id=cdd03597-362b-4dcc-b588-fb3fe3fba91a" alt="Reblog this post [with Zemanta]"></a></div>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2008/08/20/obtaining-overrepresented-motifs-in-dna-sequences-part-13/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>End of section 3 and Comments</title>
		<link>http://python.genedrift.org/2007/03/26/end-of-section-3-and-comments/</link>
		<comments>http://python.genedrift.org/2007/03/26/end-of-section-3-and-comments/#comments</comments>
		<pubDate>Tue, 27 Mar 2007 02:51:21 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 3]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/2007/03/26/end-of-section-3-and-comments/</guid>
		<description><![CDATA[With the last post we finished Section 3 of the book. As always some subjects were not covered, maybe because Python is different than perl or maybe that is not suitable for an introduction to Python.
From now on, the book presents much longer scripts and I will try to follow closely, and include some modifications [...]]]></description>
			<content:encoded><![CDATA[<p>With the last post we finished Section 3 of the book. As always some subjects were not covered, maybe because Python is different than perl or maybe that is not suitable for an introduction to Python.</p>
<p>From now on, the book presents much longer scripts and I will try to follow closely, and include some modifications and updates that I find opportune.</p>
<p>Regarding comments, I have been receiving a lot of spam in the <a href="http://blindscientist.genedrift.org" target="_blank">Blind.Scientist</a> blog so I kept it closed here until I learned a little bit more about WordPress administration. Please fell free to post your comments and critics and point out errors and mistakes.</p>
<p>Thanks.</p>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2007/03/26/end-of-section-3-and-comments/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Command line arguments and a second take on functions</title>
		<link>http://python.genedrift.org/2007/03/26/command-line-arguments-and-a-second-take-on-functions/</link>
		<comments>http://python.genedrift.org/2007/03/26/command-line-arguments-and-a-second-take-on-functions/#comments</comments>
		<pubDate>Mon, 26 Mar 2007 18:18:48 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 3]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/2007/03/26/command-line-arguments-and-a-second-take-on-functions/</guid>
		<description><![CDATA[We have seen, briefly, how to define and use a function in Python. Now we are going to jump forward a bit and create a new function and at the same time take a look on command line parameters that can be passed to the script.
If you have used command line applications before, you might [...]]]></description>
			<content:encoded><![CDATA[<p>We have seen, briefly, how to define and use a function in Python. Now we are going to jump forward a bit and create a new function and at the same time take a look on command line parameters that can be passed to the script.</p>
<p>If you have used command line applications before, you might have encountered programs that asks for a file name, a calculation parameter, etc to be passed in the command line. Python scripts are no different, they accept such parameters. For this we have the <code>sys</code> module that has system specific parameters and functions. We have used before the <code>sys.exit</code>, imported as an extra module function. Every operating system (even Windows) has arguments in its command line, and programming languages usually call such arguments <code>argv</code> (in the C/C++ you have argv in the parameters of the main function). Lists in Python start at 0 (zero), and for the argument list the first item is the script/program name. Basically if we have this</p>
<p><code>$> python myscript.py DNA.txt</code></p>
<p><code>myscript.py</code> is the argument 0 in the list and DNA.txt is the argument 1. So whenever we create a script that receives arguments in the command line, we have to start (in most cases, be aware) from 1. In Python using system arguments in the CLI will look like</p>
<pre name="code" class="python">
import sys

filename = sys.argv[1]
valueone = sys.argv[2]
...
</pre>
<p>We will a variation of our previous script that counts the bases, now with command line arguments and a function (with no &#8220;error&#8221; checking at first)</p>
<pre name="code" class="python">
#!/usr/bin/env python

import sys

def count_nucleotide_types(seq):
    result = []
    totalA = seq.count(&#039;A&#039;)
    totalC = seq.count(&#039;C&#039;)
    totalG = seq.count(&#039;G&#039;)
    totalT = seq.count(&#039;T&#039;)

    result.append(totalA)
    result.append(totalC)
    result.append(totalG)
    result.append(totalT)

    return result

sequencefile = open(sys.argv[1], &#039;r&#039;).readlines()
sequence = &#039;&#039;.join(sequencefile)
sequence = sequence.replace(&#039;n&#039;, &#039;&#039;)
values = count_nucleotide_types(sequence)
print &quot;Found &quot; + str(result[0] + &quot;As&quot;
print &quot;Found &quot; + str(result[0] + &quot;Cs&quot;
print &quot;Found &quot; + str(result[0] + &quot;Gs&quot;
print &quot;Found &quot; + str(result[0] + &quot;Ts&quot;
</pre>
<p>Few new things here. We created a function <code>count_nucleotide_types</code> that should receive a string containing the sequence. The &#8220;real&#8221; first line of the program flow is the one that gets the name of the file from the command line argument, open and read it. We then convert the list to a string, modify it a but and throw it to the function. Get the result back, and done.</p>
<p>With functions we actually don&#8217;t save coding time/length (at least here), we make out code more organized, easier to read and somewhat easier to someone else read and understand it. It is not a good coding practice to have long programs/scripts with no functions, no subdivision, no structure. Functions are sometimes good program nuggets that can be reused in the same application or even ported/copied to other applications and reused indefinitely. Soon we will see a function and class that reads a FASTA file in Python that can be used anywhere in any program that needs such feature. Try the code and come back later for more.</p>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2007/03/26/command-line-arguments-and-a-second-take-on-functions/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Python functions: simple example</title>
		<link>http://python.genedrift.org/2007/03/22/python-functions-simple-example/</link>
		<comments>http://python.genedrift.org/2007/03/22/python-functions-simple-example/#comments</comments>
		<pubDate>Thu, 22 Mar 2007 14:04:10 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 3]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/2007/03/22/python-functions-simple-example/</guid>
		<description><![CDATA[Python subroutines do not exist. Everything is a function, all functions return a value (even if it&#8217;s None), and all functions start with def. This statement is from Dive into Python, a book on Python programming available for free. As mentioned Python functions start with the word def, which is followed by the function name [...]]]></description>
			<content:encoded><![CDATA[<p>Python subroutines do not exist. <a href="http://www.diveintopython.org/getting_to_know_python/declaring_functions.html">Everything is a function, all functions return a value (even if it&#8217;s None), and all functions start with <code>def</code></a>. This statement is from Dive into Python, a book on Python programming available for free. As mentioned Python functions start with the word <code>def</code>, which is followed by the function name that is followed by the arguments the function receives in between parentheses. Something like</p>
<pre name="code" class="python">
def my_first_function(somevalue):
</pre>
<p>Usually Python coders (sometime called Pythonistas, among others), following the Python coding style (that states: <em>Function names should be lowercase, with words separated by underscores as necessary to improve readability</em>.) name their functions with words spearated by underscores. And we are going to use this style here, whenever a function becomes handy. The parameters passed to the function (above <code>somevalue</code>) do not have a datatype, Python should handle it whatever is being passed. It is also attribute of your code to handle the parameter/value passed inside the function and avoid errors. Functions also follow the same identation of normal programming and the line after the decalaration should be idented with four spaces</p>
<pre name="code" class="python">
def my_first_function(somevalue):
    do_something
</pre>
<p>So, let&#8217;s warmup with functions. The following script is just the start: it adds a poly-T tail to a DNA sequence. We are going to use our old friend AY162388.seq. I will be back after the script</p>
<pre name="code" class="python">
#! /usr/bin/env python

def add_tail(seq):
    result = seq + &#039;TTTTTTTTTTTTTTTTTTTTT&#039;
    return result

dnafile = &#039;AY162388.seq&#039;
file = open(dnafile, &#039;r&#039;)

sequence = &#039;&#039;
for line in file:
    sequence += line.strip()

print sequence
sequence = add_tail(sequence)
print sequence
</pre>
<p>Not very useful, at first sight, but gives us an impression of what a function looks like. Basically we define a function <code>add_tail</code> that receives <code>seq</code> as a parameter. Don&#8217;t worry about variable scope now, we will see it later. The rest of the script is just like things we saw before, except for the line <code>sequence = add_tail(sequence)</code>. Here we are saving memory (yep, not that much and not even impressive) by assigning the return value of the function to the same string where we have the sequence stored. Run the scritp and get ready for the command line arguments.</p>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2007/03/22/python-functions-simple-example/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Debugging in Python</title>
		<link>http://python.genedrift.org/2007/03/15/debugging-in-python/</link>
		<comments>http://python.genedrift.org/2007/03/15/debugging-in-python/#comments</comments>
		<pubDate>Thu, 15 Mar 2007 14:43:10 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 3]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/2007/03/15/debugging-in-python/</guid>
		<description><![CDATA[Beginning the third section in our tutorial/guide, we are going to see the chapter six of BPB. This chapter discusses the topics of creating subroutines (in Python&#8217;s case functions) and debugging the code.
We are going to start by the end. In Python, code debugging can be done as in any other programming language: Perl has [...]]]></description>
			<content:encoded><![CDATA[<p>Beginning the third section in our tutorial/guide, we are going to see the chapter six of BPB. This chapter discusses the topics of creating subroutines (in Python&#8217;s case functions) and debugging the code.</p>
<p>We are going to start by the end. In Python, code debugging can be done as in any other programming language: Perl has pdb, C/C++ has gdb, etc. Python also has a pdb module that can be imported and run to check for errors in your code. Using this command line:</p>
<p><code>python -m pdb myscript</code></p>
<p>will start the debug module and this will run your script. If you are an experienced programmer, who is just starting Python, pdb usage might look simple and straightforward. On the other hand, if you don&#8217;t have a lot of experience in programming I would suggest a different approach, as you become more comfortable with the language. Python has a great advantage over some other interpreted languages, allowing you to interactively code using the interpreter. So if your code is not working properly, maybe a wrong output or a value that is not being correctly calculated you have the options of coding the part of your script that is not working using the interpreter or use the first rule of debugging: include <code>print</code> statements that output the value of variables/objects.</p>
<p>Another option is to use a Python code editor, what will also help you with highlight your code. I have little experience with Python code editors, as I normally code in Linux and use <a href="http://kate-editor.org/">Kate</a>. Lately I have been trying <a href="http://www.activestate.com/products/komodo_edit/">Komodo edit</a> which is a cross-platform freeware from Active State. It looks pretty good but I never tried debugging my code with it.</p>
<p>So, these are my advices if you are just starting to program. Maybe because of the age of Beginning Perl for Bioinformatics (published in 2001), Perl&#8217;s pdb was the only option back then. Thanks to major advances on open-source and free software there are many other options nowadays to debug your code.</p>
<p><code>python -m pdb myscript</code></p>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2007/03/15/debugging-in-python/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

