<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Beginning Python for Bioinformatics &#187; Section 7</title>
	<atom:link href="http://python.genedrift.org/category/section-7/feed/" rel="self" type="application/rss+xml" />
	<link>http://python.genedrift.org</link>
	<description>a step-by-step guide to create Python applications in bioinformatics</description>
	<lastBuildDate>Wed, 10 Mar 2010 13:03:32 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=3.0-alpha</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>GenBank: parsing some features (and a change for the future)</title>
		<link>http://python.genedrift.org/2007/10/03/genbank-parsing-some-features-and-a-change-for-the-future/</link>
		<comments>http://python.genedrift.org/2007/10/03/genbank-parsing-some-features-and-a-change-for-the-future/#comments</comments>
		<pubDate>Wed, 03 Oct 2007 21:25:16 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 7]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/2007/10/03/genbank-parsing-some-features-and-a-change-for-the-future/</guid>
		<description><![CDATA[This is the last entry based on the book. In my opinion, further topics in the book are a little bit redundant and can be accomplished quite easily if you have followed the tutorial here. If a good number of people have interest in checking the remainder of the book, just let me know and [...]]]></description>
			<content:encoded><![CDATA[<p>This is the last entry based on the book. In my opinion, further topics in the book are a little bit redundant and can be accomplished quite easily if you have followed the tutorial here. If a good number of people have interest in checking the remainder of the book, just let me know and I will get back and follow the book. At the same time I am accepting suggestions on topics to be covered (send me an email or leave a comment). I already have some in mind and I am preparing a couple for the next phase of the website. So, here is the last entry.</p>
<p>Last time we saw how to extract the sequence from a GenBank file. This time we are going to parse some other information from these files. Basically we will use the same idea of our last post to extract the Organism name, the Locus and the Accession number of the item. From our last entry we have to remember this</p>
<pre name="code" class="python">
sequence = &#039;&#039;
issequence = False
for line in gbfile:
    if issequence == True and not line.find(&#039;/&#039;) == 0:
        sequence += line.lstrip(&#039;0123456789 &#039;).replace(&#039; &#039;, &#039;&#039;)
    elif line.find(&#039;ORIGIN&#039;) &gt;= 0:
        issequence = True
</pre>
<p>and modify it to our needs. Looks simple, and it is. Let&#8217;s see</p>
<pre name="code" class="python">
import sys

gbfile = open(sys.argv[1], &#039;r&#039;).readlines()

locus = &#039;&#039;
organism = &#039;&#039;
accession = &#039;&#039;
for line in gbfile:
    if line.find(&#039;LOCUS&#039;) &gt;= 0:
        locus = line
    elif line.find(&#039;ACCESSION&#039;) &gt;= 0:
        accession = line
    elif line.find(&#039;ORGANISM&#039;) &gt;= 0:
        organism = line

print locus.strip()
print organism.strip()
print accession.strip()
</pre>
<p>Just add a flag for each entry you want to parse and that&#8217;s it. For longer entries, such as the sequence, we have to use the same approach used before, with a boolean flag and concatenating the lines until another flag is found.</p>
<p>&#8212;&#8212;&#8212;&#8212;-</p>
<p>Well, that&#8217;s it. After 46 entries we start a new phase. Still it will be beginning Python for bioinformatics.</p>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2007/10/03/genbank-parsing-some-features-and-a-change-for-the-future/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>GenBank files: take two</title>
		<link>http://python.genedrift.org/2007/09/28/genbank-files-take-two/</link>
		<comments>http://python.genedrift.org/2007/09/28/genbank-files-take-two/#comments</comments>
		<pubDate>Fri, 28 Sep 2007 14:50:27 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 7]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/2007/09/28/genbank-files-take-two/</guid>
		<description><![CDATA[Last time we saw how to extract the sequence from our GenBank file, but the final result might not be as nice as we wanted it to be. So, we need to make it prettier. Running last entry&#8217;s script we end up with this sequence
        1 gtttggtcct aaccttgtaa [...]]]></description>
			<content:encoded><![CDATA[<p>Last time we saw how to extract the sequence from our GenBank file, but the final result might not be as nice as we wanted it to be. So, we need to make it prettier. Running last entry&#8217;s script we end up with this sequence</p>
<p>        1 gtttggtcct aaccttgtaa tcaattttta cttaatatac acatgcaagt ctccgcaccc<br />
       61 ctgtgaaaac gcccttaaat cccccatggg ataaggagct ggtatcaggc acgaaaatct<br />
      121 gcccaaaaca cctagctatg ccacacccac aagggtactc agcagtgatt gacattattt<br />
      181 ataagcgcca gcttgactca gttaaagtaa agagagccgg caaatctggt gccagccgcc<br />
      241 gcggttaccc cacgtggctc aaattgattt ctttcggcgt aaagcgtgat taaagtgccc<br />
      301 atcaacattg gagttaaact aaaattaagc tgtgacacgc ttattttaca gaaaagcaca<br />
      361 aacgaaagtt acttcaattt aaacaacttg aattcacgac agtcaggaca caaactggga<br />
      421 ttagataccc cactatgccc gaccgtaaac tttaatttac accatcaccg ccagagaact<br />
      481 acgagcaaag cttaaaactc aaaggacttg acggtccccc acatccccct agaggagcct<br />
      541 gtcctttaat cgataatccc cgcttaacct caccattctt agtctttcag cctgtatacc<br />
      601 tccgtcgtca gcttaccccg tgagcgaaaa ttagtgagct taatgtccac acgtctacac<br />
      661 gtcaggtcaa ggtgcagcaa atataatggg aagagatggg ctacactttc tagtctagaa<br />
      721 tatacgaaag accacctatg aaacctggtc agaaggcgga tttagaagta aaaggaaacc<br />
      781 agagcatccc ttttaatttg gcactggggc atgtacacac cgcccgtcac cctcttcaaa<br />
      841 gcctaatttt agtatctaac caactaacgc ctagtagaag aggcaagtcg taacatggta<br />
      901 agtataccgg aaggtgtgct tggaaacaaa atatagccta atcaaagcat ttcgcttaca<br />
      961 ccgaaaagtt atctgtgaaa ttcagattat tttgagctaa aaatctagcc ccactttatt<br />
     1021 ctataatccc ttatcactta aattcatgaa tcaaaacatt ttaataatca agtaaaggcg<br />
     1081 attgaaaaat taataggagc aatatatact gtaccgcaag ggaaagatga aatagaaatg<br />
     1141 aaataataat taaagcataa aaaagtaaag attaaatctt gtaccttttg catcatgatt<br />
     1201 taactagtct acccaggcaa aatgatttta agtctgacct cccgaaacta agtgagctac<br />
     1261 ttcaaggcag cttaatgagc aaatccgtct ctgtcgcaaa agagtggaga gaccttcaag<br />
     1321 tagaagtgat agacctaacg aacttagtaa tagctggtta ttcaagaaaa ggatctcagt<br />
     1381 ccaacctaaa gtcaaattaa tgtttaaaaa taaaaattct gaccttagag taattcaatt<br />
     1441 aaggtacagc ctacttgaaa caggatacaa ccttaactaa tgggtaactt accccttcat<br />
     1501 cttttaagtg ggcctaaaag cagccacctt taaaatagcg tcaaagctta gccgtcctat<br />
     1561 acatctaata ccaaaaacat ctatgaaccc tatactcata ttgaataatt ctatattatt<br />
     1621 atagagattt ttatgttaaa actagtaaca agaattaaat tttctctatt atgttcgtgt<br />
     1681 acatcagaaa ggataaacca ctgataattg acatgcatga gtaaaaagca gtaacttaac<br />
     1741 aagaaaaccc tcctaactct aatgttaacc taacacaagt acatctcaag aaagatttaa<br />
     1801 agaaaaagaa ggaactcggc aaacattaac ctcgcctgtt taccaaaaac atcgcctctt<br />
     1861 gtcaaaattt aagaggtcca gcctgcccag tgaccctgtt caacggccgc ggtatcctaa<br />
     1921 ccgtgcgaag gtagcgtaat cacttgttct ttaaataagg actagtatga atggcaccac<br />
     1981 gagggttata ctgtctcctt tttctaatca gtgaaactaa tcttcccgtg aagaagcggg<br />
     2041 aatttttata taagacgaga agaccctatg gagctttaga cgagtaacaa ctgctaattt<br />
     2101 tataatattt cagataatat ctctatccta gcattatgat tataagtctt tggttggggt<br />
     2161 gaccgcggag aaaaaaataa cctccacatt gaaagaatat tattctaagc aaaaagacac<br />
     2221 atctttaagc atcaacaaat tgacatctat tgacccaata ttttgatcaa cgaaccaagt<br />
     2281 taccctaggg ataacagcgc aatccacttc gagagctctt atcgacaagt gggcttacga<br />
     2341 cctcgatgtt ggatcagggt atcctagtgg tgtagccgct actaaaggtt cgtttgttca<br />
     2401 acgattaaaa ccct<br />
//</p>
<p>Notice that there were no mention to the <code>//</code> at the end of the file This double slash represents the EOF of a GenBank file. Ideally we need to remove it from the final output, as we need to remove the nucleotide number at the beginning of each line. In order to do that we need to modify our last script. </p>
<p>First we will add a condition on the <code>for</code> loop that will take care of the double slash at the end. Then we need to use a trick to remove the numbers from the lines. We should expect the numbers to be at the beginning of the line, and never in another place. The long way would be to create a regular expression and then replace every number occurrence with an empty string. But Python comes with batteries included, so we will take the short path. We use the <code>lstrip</code> string method that will remove characters from the left part of the string, and set it to remove numbers from 0 to 9 and the extra space before the nucleotides. And our script will like this:</p>
<pre name="code" class="python">
import sys

gbfile = open(sys.argv[1], &#039;r&#039;).readlines()

sequence = &#039;&#039;
issequence = False
for line in gbfile:
    if issequence == True and not line.find(&#039;/&#039;) == 0:
        sequence += line.lstrip(&#039;0123456789 &#039;)
    elif line.find(&#039;ORIGIN&#039;) &gt;= 0:
        issequence = True

print sequence
</pre>
<p>Notice that the first <code>if</code> condition was modified to include a condition that we do <b>not</b> find a slash. And at the line where we concatenate the sequence we added a <code>lstrip('0123456789 ')</code> to modify the string on the fly. The final result looks much better now. But we still have a small &#8220;problem&#8221;: there are spaces between groups of 10 nucleotides and we want to get rid of them. We would be surprised if it was difficult, and as expected it is not. We used <a href="http://python.genedrift.org/2007/09/07/restrinction-enzymes-second-take/">here</a> the replace method and we can use it here too. We only need to modify the line that concatenates the sequence, and our final script will be</p>
<pre name="code" class="python">
import sys

gbfile = open(sys.argv[1], &#039;r&#039;).readlines()

sequence = &#039;&#039;
issequence = False
for line in gbfile:
    if issequence == True and not line.find(&#039;/&#039;) == 0:
        sequence += line.lstrip(&#039;0123456789 &#039;).replace(&#039; &#039;, &#039;&#039;)
    elif line.find(&#039;ORIGIN&#039;) &gt;= 0:
        issequence = True

print sequence
</pre>
<p>Notice that we add the <code>replace</code> method after the <code>lstrip</code>, but it can either way. The output should be only the nucleotides, with no space or numbers.</p>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2007/09/28/genbank-files-take-two/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>GenBank files: take one</title>
		<link>http://python.genedrift.org/2007/09/27/genbank-files/</link>
		<comments>http://python.genedrift.org/2007/09/27/genbank-files/#comments</comments>
		<pubDate>Thu, 27 Sep 2007 21:13:14 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 7]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/2007/09/27/genbank-files/</guid>
		<description><![CDATA[We are back, moving to a new chapter of the book and starting a new section on the site. This chapter deals with the manipulation of GenBank files. These files are used by NCBI to store information about RNA, DNA and protein sequences. It is usually composed of an annotation section, that gives information about [...]]]></description>
			<content:encoded><![CDATA[<p>We are back, moving to a new chapter of the book and starting a new section on the site. This chapter deals with the manipulation of <a href="http://www.ncbi.nlm.nih.gov">GenBank</a> files. These files are used by NCBI to store information about RNA, DNA and protein sequences. It is usually composed of an annotation section, that gives information about the sequence present in the particular file. I won&#8217;t spend much time explaining the GenBank format, because it is not the goal of the site. The perl book has some good explanation about it and you can also find more information <a href="http://www.umanitoba.ca/afs/plant_science/psgendb/doc/GenBank/gbrel.txt">here</a>. Also, we are going to see here some of the characteristics of such files.</p>
<p>The GenBank file we are going to manipulate from now on is this one</p>
<blockquote><p>LOCUS       DQ283072                2414 bp    DNA     linear   VRT 23-MAR-2006<br />
DEFINITION  Megaelosia goeldii 12S ribosomal RNA gene, partial sequence;<br />
            tRNA-Val gene, complete sequence; and 16S ribosomal RNA gene,<br />
            partial sequence; mitochondrial.<br />
ACCESSION   DQ283072<br />
VERSION     DQ283072.1  GI:90296241<br />
KEYWORDS    .<br />
SOURCE      mitochondrion Megaelosia goeldii (Rio big-tooth frog)<br />
  ORGANISM  Megaelosia goeldii<br />
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;<br />
            Amphibia; Batrachia; Anura; Neobatrachia; Hyloidea;<br />
            Leptodactylidae; Cycloramphinae; Megaelosia.<br />
REFERENCE   1  (bases 1 to 2414)<br />
  AUTHORS   Frost,D.R., Grant,T., Faivovich,J., Bain,R., Haas,A.,<br />
            Haddad,C.F.B., de Sa,R.O., Channing,A., Wilkinson,M.,<br />
            Donnellan,S.C., Raxworthy,C., Campbell,J.A., Blotto,B.L., Moler,P.,<br />
            Drewes,R.C., Nussbaum,R.A., Lynch,J.D., Green,D.M. and Wheeler,W.C.<br />
  TITLE     The Amphibian Tree of Life<br />
  JOURNAL   Bull. Am. Mus. Nat. Hist. 297, 1-291 (2006)<br />
REFERENCE   2  (bases 1 to 2414)<br />
  AUTHORS   Frost,D.R., Grant,T., Faivovich,J., Bain,R., Haas,A.,<br />
            Haddad,C.F.B., de Sa,R.O., Channing,A., Wilkinson,M.,<br />
            Donnellan,S.C., Raxworthy,C., Campbell,J.A., Blotto,B.L., Moler,P.,<br />
            Drewes,R.C., Nussbaum,R.A., Lynch,J.D., Green,D.M. and Wheeler,W.C.<br />
  TITLE     Direct Submission<br />
  JOURNAL   Submitted (26-OCT-2005) Herpetology, Division of Vertebrate<br />
            Zoology, American Museum of Natural History, Central Park West at<br />
            79th Street, New York, NY 10024, USA<br />
FEATURES             Location/Qualifiers<br />
     source          1..2414<br />
                     /organism=&#8221;Megaelosia goeldii&#8221;<br />
                     /organelle=&#8221;mitochondrion&#8221;<br />
                     /mol_type=&#8221;genomic DNA&#8221;<br />
                     /specimen_voucher=&#8221;Paulo Nuin&#8221;<br />
                     /db_xref=&#8221;taxon:209670&#8243;<br />
                     /country=&#8221;Brazil: Rio de Janeiro, Teresopolis, Rio Beija<br />
                     Flor, 910 m, 22&#8242;24&#8242;S, 42&#8242;69&#8242;W&#8221;<br />
     misc_RNA        <1..>2414<br />
                     /note=&#8221;contains 12S ribosomal RNA, tRNA-Val, and 16S<br />
                     ribosomal RNA&#8221;<br />
ORIGIN<br />
        1 gtttggtcct aaccttgtaa tcaattttta cttaatatac acatgcaagt ctccgcaccc<br />
       61 ctgtgaaaac gcccttaaat cccccatggg ataaggagct ggtatcaggc acgaaaatct<br />
      121 gcccaaaaca cctagctatg ccacacccac aagggtactc agcagtgatt gacattattt<br />
      181 ataagcgcca gcttgactca gttaaagtaa agagagccgg caaatctggt gccagccgcc<br />
      241 gcggttaccc cacgtggctc aaattgattt ctttcggcgt aaagcgtgat taaagtgccc<br />
      301 atcaacattg gagttaaact aaaattaagc tgtgacacgc ttattttaca gaaaagcaca<br />
      361 aacgaaagtt acttcaattt aaacaacttg aattcacgac agtcaggaca caaactggga<br />
      421 ttagataccc cactatgccc gaccgtaaac tttaatttac accatcaccg ccagagaact<br />
      481 acgagcaaag cttaaaactc aaaggacttg acggtccccc acatccccct agaggagcct<br />
      541 gtcctttaat cgataatccc cgcttaacct caccattctt agtctttcag cctgtatacc<br />
      601 tccgtcgtca gcttaccccg tgagcgaaaa ttagtgagct taatgtccac acgtctacac<br />
      661 gtcaggtcaa ggtgcagcaa atataatggg aagagatggg ctacactttc tagtctagaa<br />
      721 tatacgaaag accacctatg aaacctggtc agaaggcgga tttagaagta aaaggaaacc<br />
      781 agagcatccc ttttaatttg gcactggggc atgtacacac cgcccgtcac cctcttcaaa<br />
      841 gcctaatttt agtatctaac caactaacgc ctagtagaag aggcaagtcg taacatggta<br />
      901 agtataccgg aaggtgtgct tggaaacaaa atatagccta atcaaagcat ttcgcttaca<br />
      961 ccgaaaagtt atctgtgaaa ttcagattat tttgagctaa aaatctagcc ccactttatt<br />
     1021 ctataatccc ttatcactta aattcatgaa tcaaaacatt ttaataatca agtaaaggcg<br />
     1081 attgaaaaat taataggagc aatatatact gtaccgcaag ggaaagatga aatagaaatg<br />
     1141 aaataataat taaagcataa aaaagtaaag attaaatctt gtaccttttg catcatgatt<br />
     1201 taactagtct acccaggcaa aatgatttta agtctgacct cccgaaacta agtgagctac<br />
     1261 ttcaaggcag cttaatgagc aaatccgtct ctgtcgcaaa agagtggaga gaccttcaag<br />
     1321 tagaagtgat agacctaacg aacttagtaa tagctggtta ttcaagaaaa ggatctcagt<br />
     1381 ccaacctaaa gtcaaattaa tgtttaaaaa taaaaattct gaccttagag taattcaatt<br />
     1441 aaggtacagc ctacttgaaa caggatacaa ccttaactaa tgggtaactt accccttcat<br />
     1501 cttttaagtg ggcctaaaag cagccacctt taaaatagcg tcaaagctta gccgtcctat<br />
     1561 acatctaata ccaaaaacat ctatgaaccc tatactcata ttgaataatt ctatattatt<br />
     1621 atagagattt ttatgttaaa actagtaaca agaattaaat tttctctatt atgttcgtgt<br />
     1681 acatcagaaa ggataaacca ctgataattg acatgcatga gtaaaaagca gtaacttaac<br />
     1741 aagaaaaccc tcctaactct aatgttaacc taacacaagt acatctcaag aaagatttaa<br />
     1801 agaaaaagaa ggaactcggc aaacattaac ctcgcctgtt taccaaaaac atcgcctctt<br />
     1861 gtcaaaattt aagaggtcca gcctgcccag tgaccctgtt caacggccgc ggtatcctaa<br />
     1921 ccgtgcgaag gtagcgtaat cacttgttct ttaaataagg actagtatga atggcaccac<br />
     1981 gagggttata ctgtctcctt tttctaatca gtgaaactaa tcttcccgtg aagaagcggg<br />
     2041 aatttttata taagacgaga agaccctatg gagctttaga cgagtaacaa ctgctaattt<br />
     2101 tataatattt cagataatat ctctatccta gcattatgat tataagtctt tggttggggt<br />
     2161 gaccgcggag aaaaaaataa cctccacatt gaaagaatat tattctaagc aaaaagacac<br />
     2221 atctttaagc atcaacaaat tgacatctat tgacccaata ttttgatcaa cgaaccaagt<br />
     2281 taccctaggg ataacagcgc aatccacttc gagagctctt atcgacaagt gggcttacga<br />
     2341 cctcgatgtt ggatcagggt atcctagtgg tgtagccgct actaaaggtt cgtttgttca<br />
     2401 acgattaaaa ccct<br />
//</p></blockquote>
<p>which stores a sequence of a mitochondrial gene of a stream frog from South America, called <em>Megaelosia goeldii</em>, also known as Rio big-tooth frog. (get a better formatted file <a href="http://python.genedrift.org/megaelosia.gbk">here</a>)</p>
<p>Our fast task will be to extract the DNA sequence from the file. This sounds easy, and not surprisingly it is. If we take a closer look at the file we will see that the sequence starts after the mark <strong>ORIGIN</strong>. From what we have seen before we just need to read the file, the a boolean variable that checks for <strong>ORIGIN</strong> and concatenate everything after that. Something like this</p>
<pre name="code" class="python">
import sys

gbfile = open(sys.argv[1], &#039;r&#039;).readlines()

sequence = &#039;&#039;
issequence = False
for line in gbfile:
    if issequence == True:
        sequence += line
    elif line.find(&#039;ORIGIN&#039;) &gt;= 0:
        issequence = True

print sequence
</pre>
<p>Quick and easy. When we find  <strong>ORIGIN</strong>, <code>issequence</code> has its state changed to True and the lines below will concatenated into a string. We print at the end.</p>
<p>Next time we will do more fancy things.</p>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2007/09/27/genbank-files/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

