<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Beginning Python for Bioinformatics &#187; Section 1</title>
	<atom:link href="http://python.genedrift.org/category/section-1/feed/" rel="self" type="application/rss+xml" />
	<link>http://python.genedrift.org</link>
	<description>a step-by-step guide to create Python applications in bioinformatics</description>
	<lastBuildDate>Thu, 20 May 2010 21:34:41 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1-alpha</generator>
		<item>
		<title>&#8220;Manipulating&#8221; Python lists II</title>
		<link>http://python.genedrift.org/2007/02/12/manipulating-python-lists-ii/</link>
		<comments>http://python.genedrift.org/2007/02/12/manipulating-python-lists-ii/#comments</comments>
		<pubDate>Mon, 12 Feb 2007 19:18:03 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 1]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/2007/02/12/manipulating-python-lists-ii/</guid>
		<description><![CDATA[As mentioned we will see in this entry some other features of Python lists. We will start with a similar example to the one in the book and then use our DNA file. So let&#8217;s assume we have this simple list nucleotides = [ &#039;A&#039;, &#039;C&#039;, &#039;G&#039;, &#039;T&#039;] If we print it directly we would [...]]]></description>
			<content:encoded><![CDATA[<p>As mentioned we will see in this entry some other features of Python lists. We will start with a similar example to the one in the book and then use our DNA file. So let&#8217;s assume we have this simple list</p>
<pre name="code" class="python">
nucleotides =  [ &#039;A&#039;, &#039;C&#039;, &#039;G&#039;, &#039;T&#039;]
</pre>
<p>If we print it directly we would get something like this</p>
<pre name="code" class="python">
[&#039;A&#039;, &#039;C&#039;, &#039;G&#039;, &#039;T&#039;]
</pre>
<p>which is fine for now, as we are not worried (yet) with the output (what we will do further below). Let&#8217;s remove the last nucleotide. To accomplish that, we use <code>pop</code> with no specific index</p>
<pre name="code" class="python">
nucleotides.pop()
</pre>
<p>which gives me this when printed</p>
<pre name="code" class="python">
[&#039;A&#039;, &#039;C&#039;, &#039;G&#039;]
</pre>
<p>Remember that lists are mutable, so the removed item is lost. We can also remove any other in the list, let&#8217;s say &#8216;C&#8217;. First, we reassign the original list items and then remove the second item</p>
<pre name="code" class="python">
nucleotides = [ &#039;A&#039;, &#039;C&#039;, &#039;G&#039;. &#039;T&#039;]
nucleotides.pop(1)
</pre>
<p>The list when printed will return this</p>
<pre name="code" class="python">
[&#039;A&#039;, &#039;G&#039;, &#039;T&#039;]
</pre>
<p><code>pop</code> accepts any valid index of the list. Any index larger that the length of the list will return an error. For future reference, remember that when any item is removed (and inserted) the indexes change and the length also. It may seems obvious but mistakes are common.</p>
<p>Shifting from our &#8216;destructive&#8217; mode, we cal also add elements to the list. Adding to the end of the list is trivial, by using <code>append</code></p>
<pre name="code" class="python">
nucleotides = [ &#039;A&#039;, &#039;C&#039;, &#039;G&#039;. &#039;T&#039;]
nucleotides.append(&#039;A&#039;)
</pre>
<p>that returns</p>
<pre name="code" class="python">
nucleotides = [ &#039;A&#039;, &#039;C&#039;, &#039;G&#039;. &#039;T&#039;, &#039;A&#039;]
</pre>
<p>Adding to any position is also very straightforward with <code>insert</code>, like this</p>
<pre name="code" class="python">
nucleotides = [ &#039;A&#039;, &#039;C&#039;, &#039;G&#039;. &#039;T&#039;]
nucleotides.insert(0, &#039;A&#039;)
</pre>
<p>where <code>insert</code> takes two arguments: first is the index of the element before which to insert and second the element to be inserted. So our line above will insert an &#8216;A&#8217; just before the &#8216;A&#8217; at position zero. We can try this</p>
<pre name="code" class="python">
nucleotides = [ &#039;A&#039;, &#039;C&#039;, &#039;G&#039;. &#039;T&#039;]
nucleotides.insert(0, &#039;A1&#039;)
nucleotides.insert(2, &#039;C1&#039;)
nucleotides.insert(4, &#039;G1&#039;)
nucleotides.insert(6, &#039;T1&#039;)
</pre>
<p>that will result in</p>
<pre name="code" class="python">
[&#039;A1&#039;, &#039;A&#039;, &#039;C1&#039;, &#039;C&#039;, &#039;T1&#039;, &#039;T&#039;, &#039;G1&#039;, &#039;G&#039;]
</pre>
<p>Notice that we add every new item at an even position, due to the fact that for every insertion the list&#8217;s length and indexes change.</p>
<p>And for last, we will take care of the output. Of course if are creating a script that requires a nicer output, printing a list is not the best way. We could create a loop and merge all entries in the list, but that would be a couple of lines and we ought to have an easier way (otherwise we could be using C++ instead). There is a way, by using the method <code>join</code>. This method will join all the elements in a list into a single string, with a selected delimiter.</p>
<pre name="code" class="python">
nucleotides = [ &#039;A&#039;, &#039;C&#039;, &#039;G&#039;. &#039;T&#039;]
&quot;&quot;.join(nucleotides)
</pre>
<p>will generate this output</p>
<pre name="code" class="python">
ACGT
</pre>
<p><code>join</code> is a method that applies to strings. The first &#8220;item&#8221; is a string, that could be anything (in our case is an empty one). The code line tells Python to get the empty string an join it to the list of strings that we call nucleotides.</p>
<p>With this we finish the first section of the site and we are moving to chapter 5 in the book.</p>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2007/02/12/manipulating-python-lists-ii/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>&#8220;Manipulating&#8221; Python lists I</title>
		<link>http://python.genedrift.org/2007/02/09/manipulating-python-lists-i/</link>
		<comments>http://python.genedrift.org/2007/02/09/manipulating-python-lists-i/#comments</comments>
		<pubDate>Fri, 09 Feb 2007 19:01:39 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 1]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/2007/02/09/manipulating-python-lists-i/</guid>
		<description><![CDATA[Now, I want to manipulate my DNA sequence, extract some nucleotides, check lines, etc. Simple things. We start with the same basic code to read the file: #! /usr/bin/env python dnafile = &#34;AY162388.seq&#34; file = open(dnafile, &#039;r&#039;).readlines() Our nucleotides are stored in the variable file. Remember that each line is one item of the list [...]]]></description>
			<content:encoded><![CDATA[<p>Now, I want to manipulate my DNA sequence, extract some nucleotides, check lines, etc. Simple things. We start with the same basic code to read the file:</p>
<pre name="code" class="python">
#! /usr/bin/env python
dnafile = &quot;AY162388.seq&quot;
file = open(dnafile, &#039;r&#039;).readlines()
</pre>
<p>Our nucleotides are stored in the variable <code>file</code>. Remember that each line is one item of the list and the lines still contain the carriage return present in the ASCII file. Let&#8217;s get the first and the last lines of the sequence. The first line is easy to get, as Python&#8217;s lists start at 0. To access one list item just add square brackets with the index number of the item you want to get (this is also known as slicing). Something like this</p>
<pre name="code" class="python">
file[0]
</pre>
<p>will return the item 0 from the list, that in our case is the firs line of the sequence. If you add a <code>print</code> command</p>
<pre name="code" class="python">
print file[0]
</pre>
<p>you should expect</p>
<p><code>GTGACTTTGTTCAACGGCC....CGTAATCACTTGTTC</code></p>
<p>The last line is a little bit trickier. Let&#8217;s assume that we don&#8217;t know the number of lines in the list, and here we want to make our script as general as possible, so it can handle some simple files later. It is also good code practice to think ahead and plan what you want, first to have a detailed project to follow, and second it allows you to be prepared to errors/bugs that your code might have or situations not expected in your original plan.</p>
<p>In our file, we have <strong>eight</strong> lines of DNA, so it would be just adding this <code>print file[<strong>7</strong>]</code> and we would output the last line. But, the right way to do it is to check the length of the list and output the item which has an index equal to the list length. In Python, you can check the length of a list by adding the built-in function <code>len</code> before the list name, like this</p>
<pre name="code" class="python">
len(file)
</pre>
<p>So who do we print the last line of our sequence? Simple. <code>len(file)</code> should return an integer of value 8, which is the actual number of elements in our list. We already know that to access any item in a list we just add its index (that has to be an integer) to the list name. One idea then would be to use <code>len(file)</code> as the index, like this</p>
<pre name="code" class="python">
print file[len(file)]
</pre>
<p>Why would that be wrong? Our list has eight items, but the indexes are from 0 to <strong>7</strong>. So eight would be one index over the list length, which is not accessible because it does not exist. Solution? Let&#8217;s use the list length minus one:</p>
<pre name="code" class="python">
print file[len(file)-1]
</pre>
<p>and there you are, the last line of the sequence. But as we want the last line of the file, which is a list there is an easier way to output just the last line:</p>
<pre name="code" class="python">
print file[-1]
</pre>
<p>which tells the interpreter to print the last item of a list.</p>
<p>Putting everything together, we have</p>
<pre name="code" class="python">
#! /usr/bin/env python
dnafile = &quot;AY162388.seq&quot;
file = open(dnafile, &#039;r&#039;).readlines()
print &#039;I want the first line&#039;
print file[0]
print &#039;now the last line&#039;
print file[-1]
</pre>
<p>Two points worth mentioning: differently of strings, Python&#8217;s lists are mutable, items can be removed, deleted, changed, and strings also can be sliced by using indexes that access characters.</p>
<p>Next we will see some more features of lists and strings, and how to manipulate them. It will probably be the last entry in the first section as we finish the Chapter 4 in the book. As you may have noticed some items in the perl book will not be covered, at least not immediately. We will jump back and forth sometimes.</p>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2007/02/09/manipulating-python-lists-i/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Reading files in Python: using lists</title>
		<link>http://python.genedrift.org/2007/02/07/reading-files-in-python-using-lists/</link>
		<comments>http://python.genedrift.org/2007/02/07/reading-files-in-python-using-lists/#comments</comments>
		<pubDate>Wed, 07 Feb 2007 18:56:59 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 1]]></category>
		<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[list]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[read files]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/2007/02/07/reading-files-in-python-using-lists/</guid>
		<description><![CDATA[Let&#8217;s improve our previous script and put the contents of the file in a variable similar to an array. Python understands different formats of compound data types, and list is the most versatile. A list in Python can be assigned by a series of elements (or values) separated by a comma and surrounded by square [...]]]></description>
			<content:encoded><![CDATA[<p>Let&#8217;s improve our previous script and put the contents of the file in a variable similar to an array. Python understands different formats of compound data types, and <code>list</code> is the most versatile. A <code>list</code> in Python can be assigned by a series of elements (or values) separated by a comma and surrounded by square brackets</p>
<pre name="code" class="python">
shoplist = [&#039;milk&#039;, 1, &#039;lettuce&#039;, 2, &#039;coffee&#039;, 3]
</pre>
<p>Now, we are going to read the same file and store the DNA sequence in a <code>list</code> and output this variable. The beginning of the script is the same, where we basically tell Python that the file name is AY162388.seq.</p>
<pre name="code" class="python">
#! /usr/bin/env python
dnafile = &quot;AY162388.seq&quot;
</pre>
<p>We are going to change the way we read the file. Instead of just opening and then reading line-by-line, we are going to open it a read all the lines at once, by using this</p>
<pre name="code" class="python">
file = open(dnafile, &#039;r&#039;).readlines()
</pre>
<p>Notice the part in bold? In the previous script, we open and store the contents of the file in a <code>file object</code>. Now , we are opening the file and just after it is opened, we are reading all the lines of the file at once and storing them in <code>file<code>file object</code>, but is a Python's <code>list</code> of strings.</code></p>
<p>Before, if we wanted to manipulate our DNA sequence, we would had to read it, and then in the loop store in a variable of our choice. In this script, we do that all at once, and the result is a variable that we can change the way we wanted. The code without the output part is</p>
<pre name="code" class="python">
#! /usr/bin/env python
dnafile = &quot;AY162388.seq&quot;
file = open(dnafile, &#039;r&#039;).readlines()
</pre>
<p>Try putting a <code>print</code> statement after the last line to print the <code>file</code> list. You will get something like this</p>
<p><code>['GTGACTT...TTGTTC\n', 'TTTAAATA....TAATC\n', 'AGTGA...CTATG\n', 'GAGCTCA....TATAGC\n', ...]</code></p>
<p>which is exactly the description of a Python&#8217;s list. You see all lines, separated by comma and surrounded by square brackets. Notice that each line has a carriage return (<code>\n&lt;\code&gt;) symbol at the end. </code></p>
<p>Let&#8217;s make the output a little nicer including a loop. Remember when I introduced loop I wrote that Python iterates  over &#8220;items in a sequence of items&#8221;, what is a good synonym for <code>list</code>. So the loop should be as straightforward as</p>
<pre name="code" class="python">
for line in file:
    print line&lt;/pre&gt;
Putting everything together gives us
&lt;pre lang=&quot;python&quot;&gt;#! /usr/bin/env python
dnafile = &quot;AY162388.seq&quot;
file = open(dnafile, &#039;r&#039;).readlines()
print file
for line in file:
    print line
</pre>
<p>that can be downloaded <a href="http://python.genedrift.org/codepy/code_05.py">here</a>. Next we will work on improving the output again and maybe modify/convert the <code>list<code>.</p>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2007/02/07/reading-files-in-python-using-lists/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Reading files in Python</title>
		<link>http://python.genedrift.org/2007/01/30/reading-files-in-python/</link>
		<comments>http://python.genedrift.org/2007/01/30/reading-files-in-python/#comments</comments>
		<pubDate>Tue, 30 Jan 2007 18:53:56 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 1]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[read files]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/2007/01/30/reading-files-in-python/</guid>
		<description><![CDATA[We are currently following Chapter 4 of Beginning Perl for Bioinformatics, which is the first chapter of the book that actually has code snippets and real programming. The last exercises in this chapter deal with the ability to read files and operate with information extracted from these files, to create arrays and scalar list in [...]]]></description>
			<content:encoded><![CDATA[<p>We are currently following Chapter 4 of Beginning Perl for Bioinformatics, which is the first chapter of the book that actually has code snippets and real programming. The last exercises in this chapter deal with the ability to read files and operate with information extracted from these files, to create arrays and scalar list in perl.</p>
<p>We are going to check how to read files in python. The book tells you how to read protein sequences. Here we are going to read DNA and protein sequences from files and change them.</p>
<p>Let say you have a file with a DNA sequence in some directory in your hard disk. The file cannot be a FASTA type (we will see later how to handle FASTA files), just pure sequence, something like this:</p>
<p><code>GTGACTTTGTTCAACGGCCGCGGTATCCTAACCGTGCGAAGGTAGCGTAATCACTTGTTC<br />
TTTAAATAAGGACTAGTATGAATGGCATCACGAGGGCTTTACTGTCTCCTTTTTCTAATC<br />
AGTGAAACTAATCTCCCGTGAAGAAGCGGGAATTAACTTATAAGACGAGAAGACCCTATG<br />
GAGCTTTAAACCAAATAACATTTGCTATTTTACAACATTCAGATATCTAATCTTTATAGC<br />
ACTATGATTACAAGTTTTAGGTTGGGGTGACCGCGGAGTAAAAATTAACCTCCACATTGA<br />
AGGAATTTCTAAGCAAAAAGCTACAACTTTAAGCATCAACAAATTGACACTTATTGACCC<br />
AATATTTTGATCAACGAACCATTACCCTAGGGATAACAGCGCAATCCATTATGAGAGCTA<br />
TTATCGACAAGTGGGCTTACGACCTCGATGTTGGATCAGGG</code></p>
<p>You can download the file <a href="http://python.genedrift.org/codepy/AY162388.seq">here</a>. This is a partial sequence of a mitochondrial gene from a South American frog species called <a href="http://calphotos.berkeley.edu/imgs/128x192/0000_0000/1101/0201.jpeg"><em>Hylodes ornatus</em></a>. For our purposes you can save the file in the same directory you are going to run the script from, or if you are using the Python interpreter start it in the directory that contains the file.</p>
<p>The file name is not important but we will use AY162388.seq from now on. The first thing we have to do is to open the file for reading. We define a string variable that will contain the file name.</p>
<pre name="code" class="python">
dnafile = &quot;AY162388.seq&quot;
</pre>
<p>In order to open the file, we can use the command <code>open</code>, that receives two strings: the first is the file name (it can be the whole location too) to be opened and the <em>mode</em> to be used, which is what you want to do with the file. The <em>mode</em> can be one or more letters that tell the interpreter what to do. For now we are going to use the <code>r</code> <em>mode</em> , which tells Python to read the file, and only do that. So our  code is</p>
<pre name="code" class="python">
file = open(dnafile, &#039;r&#039;)
</pre>
<p><code>file</code> is a file object that contains the directives to read our DNA sequence file. Now, we have actually read the contents of the file but they are stored in a file object and we did not accessed it yet. We can achieve that by using a myriad of commands. We will start with the commonest one: read the file line by line.</p>
<p><code>file</code> is our file object. We have just opened it, but Python already knows that any file contains lines (remember that this is a regular ASCII file). We are going to use a loop to read each line of the file, one by one</p>
<p>[sourcecode language='python'>for line in file:<br />
   print line[/sourcecode]</p>
<p>In Python, the <code>for</code> loop/statement iterates over items in a sequence of items, usually a string or a list (we will see Python&#8217;s <code>list</code> soon), instead of iterating over a progression of numbers. Basically, our <code>for</code> above will iterate over each line in the file until EOF (end-of-file) is reached. Our simple script to read a DNA sequence from a file and output to the screen is</p>
<pre name="code" class="python">
#! /usr/bin/env python
dnafile = &quot;AY162388.seq&quot;
file = open(dnafile, &#039;r&#039;)
for line in file:
    print line
</pre>
<p>You can download the above script <a href="http://python.genedrift.org/codepy/code_04.py">here</a>. To run it have the AY162388.seq in the same directory.</p>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2007/01/30/reading-files-in-python/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Transcribing: the &#8220;other&#8221; way</title>
		<link>http://python.genedrift.org/2007/01/23/transcribing-the-other-way/</link>
		<comments>http://python.genedrift.org/2007/01/23/transcribing-the-other-way/#comments</comments>
		<pubDate>Tue, 23 Jan 2007 18:52:30 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 1]]></category>
		<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[transcribe DNA]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/2007/01/23/transcribing-the-other-way/</guid>
		<description><![CDATA[We have seen how to transcribe DNA using regular expression, even though the regex we used cannot be considered a real one. Now we are going to simplify our small script even more and take advantage of some string capabilities of Python. Instead of using two lines, we are going to use only one. And [...]]]></description>
			<content:encoded><![CDATA[<p>We have seen how to transcribe DNA using regular expression, even though the regex we used cannot be considered a real one. Now we are going to simplify our small script even more and take advantage of some string capabilities of Python. Instead of using two lines, we are going to use only one. And also we won&#8217;t need to import anything.</p>
<p>Let&#8217;s start again with the same DNA sequence</p>
<pre name="code" class="python">
myDNA = &#039;ACGTTGCAACGTTGCAACGTTGCA&#039;
</pre>
<p>This time we are going to use <code>replace</code>. This is one of the Python&#8217;s methods to manipulate strings. Basically, we are asking the interpreter to replace a certain string by another. The method returns a new copy of your string:</p>
<pre name="code" class="python">
myRNA = myDNA.replace(&#039;T&#039;, &#039;U&#039;)
</pre>
<p>This tells Python: <code>myRNA</code> will receive a copy of <code>myDNA</code> where all Ts were changed by Us. the &#8220;dot&#8221; after <code>myDNA</code> means that the method <code>replace</code> will get that variable as input on that variable.</p>
<p>So our code from above would like this</p>
<pre name="code" class="python">
#! /usr/bin/env python

myDNA = &#039;ACGTTGCAACGTTGCAACGTTGCA&#039;
myRNA = myDNA.replace(&#039;T&#039;, &#039;U&#039;)
print myRNA
</pre>
<p>Simple and efficient. Next we will use the same approach on generating the reverse complement of a DNA sequence, with no regex pattern.</p>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2007/01/23/transcribing-the-other-way/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>The Regular Expression</title>
		<link>http://python.genedrift.org/2007/01/16/the-regular-expression/</link>
		<comments>http://python.genedrift.org/2007/01/16/the-regular-expression/#comments</comments>
		<pubDate>Tue, 16 Jan 2007 18:49:07 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 1]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[regular expression]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/2007/01/16/the-regular-expression/</guid>
		<description><![CDATA[As mentioned above, regex in Python are provided by the re module, which provides an interface for the regular expression engine. First thing we have to do is to tell the interpreter what to do and what expression to use. Let&#8217;s start with a DNA sequence. myDNA = &#039;ACGTTGCAACGTTGCAACGTTGCA&#039; How to transcribe it to RNA? [...]]]></description>
			<content:encoded><![CDATA[<p>As mentioned above, regex in Python are provided by the <code>re</code> module, which provides an interface for the regular expression engine. First thing we have to do is to tell the interpreter what to do and what expression to use.</p>
<p>Let&#8217;s start with a DNA sequence.</p>
<pre name="code" class="python">
myDNA = &#039;ACGTTGCAACGTTGCAACGTTGCA&#039;
</pre>
<p>How to transcribe it to RNA? Transcription creates a single-strand RNA molecule from the double-strand DNA; basically the final result is a similar sequence, with all <code>T</code>&#8216;s changed to <code>U</code>&#8216;s. So our regular expression has to find all <code>T</code> nucleotides in the above sequence and then replace them.</p>
<p>Regular expressions in Python need to be compiled into a RegexObject, that contains all possible regular expression operations. In our case we need to search and replace, what can be done by using the <code>sub()</code> method. According to the Python&#8217;s <a href="http://www.amk.ca/python/howto/regex/regex.html">Regular Expression HOWTO</a> <code>sub()</code> returns the string obtained by replacing the leftmost non-overlapping occurrences of the RE in string by the replacement replacement. If the pattern isn&#8217;t found, string is returned unchanged.</p>
<p>Let&#8217;s put everything above in real code. First we need to compile a new RegexObject that will search for all thymines in our sequence. It can be achieved by using this:</p>
<pre name="code" class="python">
regexp = re.compile(&#039;T&#039;)
</pre>
<p>Simple as that. This line of code tells the Python interpreter that our &#8220;regular expression&#8221; is every T in our string. Now, we have to make replace those Ts with Us. In order to do that we just tell the interpreter:</p>
<pre name="code" class="python">
myRNA = regexp.sub(&#039;U&#039;, myDNA)
</pre>
<p>Let&#8217;s look at the last two lines of code. On the first line we created a new RegexObject, <code>regexp</code> (that could have any name, as any variable) and compiled it, making our regular expression to be every T in our string. On the second line, we assigned our soon to be created RNA sequence to a new string (remember that strings in Python are immutable) and used the command <code>sub</code> to replace in the Ts by Us present in our original DNA string. Putting all together our transcription code will be</p>
<pre name="code" class="python">
#! /usr/bin/env python

import re
myDNA = &#039;ACGTTGCAACGTTGCAACGTTGCA&#039;
regexp = re.compile(&#039;T&#039;)
myRNA = regexp.sub(&#039;U&#039;, myDNA)
print myRNA
</pre>
<p>You can download the resulting script <a href="http://python.genedrift.org/codepy/code_03.py">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2007/01/16/the-regular-expression/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Python printing statement</title>
		<link>http://python.genedrift.org/2007/01/15/python-printing-statement/</link>
		<comments>http://python.genedrift.org/2007/01/15/python-printing-statement/#comments</comments>
		<pubDate>Mon, 15 Jan 2007 23:40:41 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 1]]></category>
		<category><![CDATA[print]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/2007/03/07/python-printing-statement/</guid>
		<description><![CDATA[Just an apart from the bioinformatics aspect of programming: Python&#8217;s print statement. As in most computer languages Python allows an easy way to write to the standard output. Python&#8217;s print only accepts output of strings, and if the variable sent to it is not a string it is first converted and then output. The print [...]]]></description>
			<content:encoded><![CDATA[<p>Just an apart from the bioinformatics aspect of programming: Python&#8217;s <code>print</code> statement.</p>
<p>As in most computer languages Python allows an easy way to write to the standard output. Python&#8217;s <code>print</code> only accepts output of strings, and if the variable sent to it is not a string it is first converted and then output.</p>
<p>The <code>print</code> always put a linebreak (<code>'\n'</code> or <code>"\n"</code>) at the end of the expression to be output, except when the <code>print</code> statement ends with a comma. For example:</p>
<pre name="code" class="python">
print &quot;This is a&quot;
print &quot;test&quot;
</pre>
<p>will print</p>
<p><strong>This is a</p>
<p>test</strong></p>
<p>while</p>
<pre name="code" class="python">
print &quot;This is a&quot;,
print &quot;test&quot;,
</pre>
<p>will print</p>
<p><strong>This is a test</strong></p>
<p>Of course Python&#8217;s <code>print</code> statement allows any programming escape characte, such as <code>'\n'</code> and <code>'\t'</code>.</p>
<p><strong>Concatenating strings on output</strong></p>
<p>To concatenate two strings on output there are two possible ways in Python. You can either separate the strings with a comma, like we did here</p>
<pre name="code" class="python">
print myDNA, myDNA2
</pre>
<p>or you can use the &#8220;+&#8221; sign in roder to obtain almost the same result. This is similar to what was used here</p>
<pre name="code" class="python">
myDNA3 = myDNA + myDNA2
</pre>
<p>but instead we would use the <code>print</code> command as</p>
<pre name="code" class="python">
print myDNA3 + myDNA
</pre>
<p>In the latter case, both strings will not be separated by a space and will be merged. To get the same result you would have to concatenate an extra space between the strings like</p>
<pre name="code" class="python">
print myDNA3 + &quot; &quot; + myDNA
</pre>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2007/01/15/python-printing-statement/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Importing and regular expressions</title>
		<link>http://python.genedrift.org/2007/01/15/importing-and-regular-expressions/</link>
		<comments>http://python.genedrift.org/2007/01/15/importing-and-regular-expressions/#comments</comments>
		<pubDate>Mon, 15 Jan 2007 18:47:54 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 1]]></category>
		<category><![CDATA[import]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[regular expressions]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/2007/03/08/importing-and-regular-expressions/</guid>
		<description><![CDATA[Tisdall&#8217;s book on Perl introduces next the ability to transcripts DNA sequences into RNA. In order to do that we need to check a different aspect of programming: regular expressions (or regex). Regular expressions is a pattern/string expression that works matching/describing/filtering other strings. Let&#8217;s say you want to examine or extract all vowels contained in [...]]]></description>
			<content:encoded><![CDATA[<p>Tisdall&#8217;s book on Perl introduces next the ability to transcripts DNA sequences into RNA. In order to do that we need to check a different aspect of programming: regular expressions (or regex). Regular expressions is a pattern/string expression that works matching/describing/filtering other strings.</p>
<p>Let&#8217;s say you want to examine or extract all vowels contained in one phrase, one page, one word. Another example would be to remove all html tags from a downloaded webpage. As HTML tags are encapsulated between <code><</code> and <code>></code> signs we can create a regex that will search for any characters in between the signs and remove (parse) them from our page. We will deal very briefly with regex, and if you are interested in learning more about it you can search for countless references on the internet (such as <a href="http://www.regular-expressions.info/"> this one</a>).</p>
<p>In order to use regular expression in Python we need to check another concept present in the language: importing modules. Python core functionality provides most of the usual operations and also comes with a built-in library of functions that "access to operations that are not part of the core of the language but are nevertheless built in". One of this operations is the ability to interpret regular expression that in Python is located in the <code>re</code> module. Apart from the language core, built in modules, Python can be further extended by using third-party modules imported into the language. Anyone can create a module and distribute to every Python user and programmer.</p>
<p>So, in order to have regex capabilities we literally have to <code>import</code> the regex module. We do that by entering the line:</p>
<pre name="code" class="python">
import re
</pre>
<p>Python's code style guide suggests that <code>import</code> statements should be on separate lines</p>
<pre name="code" class="python">
import sys
import re
 ...
</pre>
<p>and to be put on the top of the file (usually below the line that tells your OS to use Python's interpreter on the script).</p>
<p>So the first two lines of our new script would be</p>
<pre name="code" class="python">
#! /usr/bin/env python
import re
</pre>
<p>Now that we have the ability to use regex, we need to create one expression that will transcribe our DNA sequences into RNA.</p>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2007/01/15/importing-and-regular-expressions/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Sequences and Strings &#8211; part II</title>
		<link>http://python.genedrift.org/2006/12/14/sequences-and-strings-part-ii/</link>
		<comments>http://python.genedrift.org/2006/12/14/sequences-and-strings-part-ii/#comments</comments>
		<pubDate>Thu, 14 Dec 2006 20:30:32 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 1]]></category>
		<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[sequence]]></category>
		<category><![CDATA[strings]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/2006/12/14/sequences-and-strings-part-ii/</guid>
		<description><![CDATA[Another important task for many biologists is to merge/concatenate different strings of DNA in one unique sequence. We can modify the previous script to concatenate two distinct DNA sequences in one. We start using code_01 structure, adding some extra elements (line 3): #! /usr/bin/env python myDNA = &#34;ACGTACGTACGTACGTACGTACGT&#34; myDNA2 = &#34;TCGATCGATCGATCGATCGA&#34; print myDNA, myDNA2 So [...]]]></description>
			<content:encoded><![CDATA[<p>Another important task for many biologists is to merge/concatenate different strings of DNA in one unique sequence. We can modify the previous script to concatenate two distinct DNA sequences in one.</p>
<p>We start using code_01 structure, adding some extra elements (line 3):</p>
<pre name="code" class="python">
#! /usr/bin/env python
myDNA = &quot;ACGTACGTACGTACGTACGTACGT&quot;
myDNA2 = &quot;TCGATCGATCGATCGATCGA&quot;
print myDNA, myDNA2
</pre>
<p>So far, we added a new string containing an extra DNA sequence and we <code>print</code> both sequences. In Python <code>print</code> statement automatically adds a new line at the end of the string to be printed, unless you add a comma (,) to the end. The comma is also needed if you are going to print more than one string in order to separate them (<em>try removing the comma from the code above</em>).</p>
<p>Now, how do we merge myDNA and myDNA2? Easy in Python: just <em>sum</em> them with a plus signal:</p>
<pre name="code" class="python">
myDNA3 = myDNA + myDNA2
print myDNA3
</pre>
<p>Notice that in Python strings are immutable, meaning they cannot be changed. This immutability confers some advantages to the code where strings (<em>in Python strings are not <strong>variables</strong></em>) cannot be modified anywhere in the program and also allowing some performance gain in the interpreter. So, in order to have our sequences merged we create a third sequence that receives both strings. Finally our code will be (some captions were added):</p>
<pre name="code" class="python">
#! /usr/bin/env python
myDNA = &quot;ACGTACGTACGTACGTACGTACGT&quot;
myDNA2 = &quot;TCGATCGATCGATCGATCGA&quot;
print &quot;First and Second sequences&quot;
print myDNA, myDNA2
myDNA3 = myDNA + myDNA2
print &quot;Concatenated sequence&quot;
print myDNA3
</pre>
<p>Easy, eh? Of course these two simple scripts do no scratch the surface of Python programming, but they are a start.</p>
<p>The above code can be downloaded from the <a href="http://python.genedrift.org/repository/">repository</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2006/12/14/sequences-and-strings-part-ii/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Hands on code: Sequences and strings &#8211; part I</title>
		<link>http://python.genedrift.org/2006/12/13/hands-on-code-sequences-and-strings-part-i/</link>
		<comments>http://python.genedrift.org/2006/12/13/hands-on-code-sequences-and-strings-part-i/#comments</comments>
		<pubDate>Thu, 14 Dec 2006 03:27:48 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 1]]></category>
		<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[sequence]]></category>
		<category><![CDATA[strings]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/2006/12/13/hands-on-code-sequences-and-strings-part-i/</guid>
		<description><![CDATA[As pointed in Beginning Perl for Bioinformatics, a large percentage of bioinformatics procedures deals with strings, especially DNA and amino acids sequence data. As is largely known DNA is composed of four different nucleotides: A, C, T and G and proteins can contain up to 20 amino acids. Each one of these elements have one [...]]]></description>
			<content:encoded><![CDATA[<p>As pointed in <strong>Beginning Perl for Bioinformatics</strong>, a large percentage of bioinformatics procedures deals with strings, especially DNA and amino acids sequence data. As is largely known DNA is composed of four different nucleotides: A, C, T and G and proteins can contain up to 20 amino acids. Each one of these elements have one letter of the alphabet assigned to them. In the DNA case some letters represent one or more nucleotides that can be present at some sequence position (click <a href="http://www.cns.fr/externe/English/Projets/Resultats/iupaciub.html">here</a> for more ).</p>
<p>So, as the amino acid is the basic building block of proteins (AKA polypeptides), strings containing sequence is our most basic block, from where all the bioinformatics magic will work on.</p>
<p>Usually in Perl a string is represented by the dollar sign in front of the variable name, like this <code>$sequence</code>. Python is dynamically typed, meaning variable types are assigned/discovered by the interpreter at run time. This means that the value after the equal sign will tell the interpreter what variable type you are declaring. So in Python if you want to store a DNA sequence you can just enter</p>
<pre name="code" class="python">
mydna=&quot;ACGTACGTACGTACGTACGTACGT&quot;
</pre>
<p><em> a quick note: Python can be used with the interpreter command line or by previously saved scripts. I will try to use the latter in the code examples.</em></p>
<p>OK, we are ready to create our first Bioinformatics Python Hello World script. Let&#8217;s get the sequence above and print it on the screen. The first line  will tell the operating system what to use and where to find the Python interpreter</p>
<pre name="code" class="python">
#! /usr/bin/env python
</pre>
<p>Next we will create the variable myDNA and assign the corresponding sequence</p>
<pre name="code" class="python">
myDNA = &quot;ACGTACGTACGTACGTACGTACGT&quot;
</pre>
<p>And finally, we will print the contents of the variable to the screen:</p>
<pre name="code" class="python">
print myDNA
</pre>
<p>As mentioned above, Python mandates that you have your code indented, but in our final script this is not needed:</p>
<pre name="code" class="python">
#! /usr/bin/env python
myDNA = &quot;ACGTACGTACGTACGTACGTACGT&quot;
print myDNA
</pre>
<p>The first line tells your operating system that this is a Python script and to use the interpreter located in that directory; line two declares a variable called <code>myDNA</code> and assigns the sequence string to it and the last line simply output this variable to the screen. That simple!</p>
<p>To run this (extremely simple) script you can copy and paste the code above to your favourite text editor save the file with a <code>.py</code> extension (recommended but not necessary). To run the script, as long as you have Python installed, just open a shell and type on the command line:</p>
<pre name="code" class="python">
&gt; python code_01.py
</pre>
<p>To download the above script check the <a href="http://python.genedrift.org/repository/">repository</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2006/12/13/hands-on-code-sequences-and-strings-part-i/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Why Python and not Perl?</title>
		<link>http://python.genedrift.org/2006/12/13/4/</link>
		<comments>http://python.genedrift.org/2006/12/13/4/#comments</comments>
		<pubDate>Thu, 14 Dec 2006 03:00:09 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 1]]></category>
		<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[perl]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/?p=4</guid>
		<description><![CDATA[According to the official Python website: Python and Perl come from a similar background (Unix scripting, which both have long outgrown) [to learn more about that check this tutorial], and sport many similar features, but have a different philosophy. Perl emphasizes support for common application-oriented tasks, e.g. by having built-in regular expressions, file scanning and [...]]]></description>
			<content:encoded><![CDATA[<p>According to the official Python <a href="http://www.python.org/doc/essays/comparisons.html">website</a>:</p>
<blockquote><p>Python and Perl come from a similar background (Unix scripting, which both have long outgrown) [to learn more about that check this tutorial], and sport many similar features, but have a different philosophy. Perl emphasizes support for common application-oriented tasks, e.g. by having built-in regular expressions, file scanning and report generating features. Python emphasizes support for common programming methodologies such as data structure design and object-oriented programming, and encourages programmers to write readable (and thus maintainable) code by providing an elegant but not overly cryptic notation. As a consequence, Python comes close to Perl but rarely beats it in its original application domain; however Python has an applicability well beyond Perl&#8217;s niche.</p></blockquote>
<p>I couldn&#8217;t explain better than that. But still I have to give my take on why I prefer Python over Perl, and why I decided to use it in my day-to-day programming. First I have to admit that I am lousy Perl programmer (not even close to an apprentice monger) and I always get confused by its syntax. Second I come from a Basic/Pascal/C++ background, all of them having slightly better syntaxes than Perl. Thus, it was natural to  get on the Python bandwagon, and as the paragraph above states Python code is &#8220;extremely&#8221; readable (emphasis are mine); in no-time you can grasp it completely. OK, I admit that it has at least one odd feature : the &#8220;mandatory&#8221; indentation. In Python you have to tabulate (using tabs or space &lt;- recommended) loops, if clauses, functions, anything. Maybe this is the first and only hard step to get, but after a couple of hours of coding you will be satisfied on how good your code look.</p>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2006/12/13/4/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Beginning the begin</title>
		<link>http://python.genedrift.org/2006/12/13/3/</link>
		<comments>http://python.genedrift.org/2006/12/13/3/#comments</comments>
		<pubDate>Thu, 14 Dec 2006 02:36:29 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 1]]></category>
		<category><![CDATA[begin]]></category>
		<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[initial post]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/?p=3</guid>
		<description><![CDATA[This website uses as a premise the book: Beginning Perl for Bioinformatics by James Tisdal (click here to buy it). which was published in 2001. My idea here is to follow the structure of the book, analyzing each chapter and converting the Perl scripts into Python. The original book is very well written and an [...]]]></description>
			<content:encoded><![CDATA[<p>This website uses as a premise the book:</p>
<p>Beginning Perl for Bioinformatics by James Tisdal (click <a href="http://www.amazon.com/Beginning-Perl-Bioinformatics-James-Tisdall/dp/0596000804/sr=8-1/qid=1166046346/ref=pd_bbs_sr_1/002-4679536-1437626?ie=UTF8&amp;s=books">here</a> to buy it).</p>
<p>which was published in 2001. My idea here is to follow the structure of the book, analyzing each chapter and converting the Perl scripts into Python. The original book is very well written and an excellent starting point for any aspiring bioinformatician, either if you are a biologist that does not understand programming or a computer scientist that does not know a lot of biology and maybe even Perl.</p>
<p>In no way this website/tutorial tries to plagiarize the book and I will try to include a minimum amount Perl code, as the book is only used as an starting point (a very good one indeed) to this journey into Python. Here you will not find biological concept explanations and criticisms towards Perl. Making this clear, I will start from the beginning.</p>
<p><strong>Why Python (and not Perl)?</strong></p>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2006/12/13/3/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

