<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Beginning Python for Bioinformatics &#187; Section 6</title>
	<atom:link href="http://python.genedrift.org/category/section-6/feed/" rel="self" type="application/rss+xml" />
	<link>http://python.genedrift.org</link>
	<description>a step-by-step guide to create Python applications in bioinformatics</description>
	<lastBuildDate>Wed, 10 Mar 2010 13:03:32 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=3.0-alpha</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Restriction enzymes: the grand finale</title>
		<link>http://python.genedrift.org/2007/09/13/restriction-enzymes-the-grand-finale/</link>
		<comments>http://python.genedrift.org/2007/09/13/restriction-enzymes-the-grand-finale/#comments</comments>
		<pubDate>Thu, 13 Sep 2007 21:17:28 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 6]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/2007/09/13/restriction-enzymes-the-grand-finale/</guid>
		<description><![CDATA[We get to the last piece of the puzzle. On the last four posts we have seen each part of the restriction enzyme site searcher script, and now we put everything together. If someone is also following the book, will see that the Python code we are producing here is slightly different, but in essence [...]]]></description>
			<content:encoded><![CDATA[<p>We get to the last piece of the puzzle. On the last four posts we have seen each part of the restriction enzyme site searcher script, and now we put everything together. If someone is also following the book, will see that the Python code we are producing here is slightly different, but in essence achieves the same result. </p>
<p>Ok. We saw last time that the function which searches for the enzyme sites returns a tuple with the actual sites and positions. Why is it better to return sites and positions? Because one enzyme can have multiple sites and you might want to know where are they. Also, it helps us practice some Python skills.</p>
<p>We will start by the last piece, and part of it we already saw before.</p>
<pre name="code" class="python">
if isname:
    print &#039;Name found&#039;
    sequences = fasta.read_fasta(open(sys.argv[2], &#039;r&#039;).readlines())
    for item in sequences:
        sites, positions = find_sites(enzyme, enzymeset, item.sequence)
        print item.name[:20]+&#039;...&#039;
        for i in zip(sites, positions):
            print i[0], &#039;-&gt;&#039;, i[1]
else:
    print &#039;Enzyme name not found, please try again&#039;
</pre>
<p>This is the <code>if</code> that checks to see if the input enzyme name was found in the list. Clearly we added a couple of things. Already covered here, ff <code>isname</code> is true we go and read the sequence file, which can contain a single or multiple sequences. We start a loop and call the <code>find_sites</code> and expect the return on a tuple <code>sites, positions</code>. Next line, we print the sequence name for the output. And then &#8230;</p>
<p>Yes, and then we have something new: <code>zip</code>. This is a function that returns a list of tuples from each one of the arguments passed to it. In our case we are passing two lists (sites and positions) and we know that each site has one position, and we also know that because of the way we build both lists the i-th element in sites will be equivalent to the i-th element in positions. Using <code>zip</code> we create n tuples where the i-th tuple is equivalent to the i-th element in sites and the i-th element in positions. Confused?</p>
<p>Let&#8217;s see. Imagine that sites has is composed of</p>
<pre name="code" class="python">
[&#039;AAA&#039;, &#039;AAC&#039;, &#039;ACA&#039;, &#039;TAA&#039;]
</pre>
<p>and positions is composed of </p>
<pre name="code" class="python">
[&#039;300&#039;, &#039;454&#039;, &#039;23&#039;, &#039;345&#039;]
</pre>
<p>and we want to ouput this</p>
<p><code>AAA -> 300<br />
AAC -> 454<br />
ACA -> 23<br />
TAA -> 345</code></p>
<p>The first idea that comes to mind is to do a loop, with a range as we already know that both lists will have the same size. Something in the lines of</p>
<pre name="code" class="python">
for i in range(len(positions)):
    print sites[i], &#039;-&gt;&#039;, positions[i]
</pre>
<p>and it will work fine and it is not that long (codewise). <code>zip</code> might not be an advantage here, but it will be somewhere else, for sure. We print the results and we are done. The full code will be posted in the repository soon. Next we will move to the book&#8217;s chapter 10 and we start a new section here, checking for GenBank files.</p>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2007/09/13/restriction-enzymes-the-grand-finale/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Restriction enzymes, third take</title>
		<link>http://python.genedrift.org/2007/09/13/restriction-enzymes-third-take/</link>
		<comments>http://python.genedrift.org/2007/09/13/restriction-enzymes-third-take/#comments</comments>
		<pubDate>Thu, 13 Sep 2007 17:08:42 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 6]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/2007/09/13/restriction-enzymes-third-take/</guid>
		<description><![CDATA[We come to the penultimate part of our restriction enzyme site finder. Just a couple of pieces lacking in the puzzle and we are there. First, the most important: the function that searches for the sites, using regex patterns. We called it find_sites

def find_sites(input, set, sequence):
    iupacdict = {&#039;A&#039;:&#039;[A]&#039;,
    [...]]]></description>
			<content:encoded><![CDATA[<p>We come to the penultimate part of our restriction enzyme site finder. Just a couple of pieces lacking in the puzzle and we are there. First, the most important: the function that searches for the sites, using regex patterns. We called it <code>find_sites</code></p>
<pre name="code" class="python">
def find_sites(input, set, sequence):
    iupacdict = {&#039;A&#039;:&#039;[A]&#039;,
    &#039;C&#039;:&#039;[C]&#039;,
    &#039;G&#039;:&#039;[G]&#039;,
    &#039;T&#039;:&#039;[T]&#039;,
    &#039;M&#039;:&#039;[AC]&#039;,
    &#039;R&#039;:&#039;[AG]&#039;,
    &#039;W&#039;:&#039;[AT]&#039;,
    &#039;S&#039;:&#039;[CG]&#039;,
    &#039;Y&#039;:&#039;[CT]&#039;,
    &#039;K&#039;:&#039;[GT]&#039;,
    &#039;V&#039;:&#039;[ACG]&#039;,
    &#039;H&#039;:&#039;[ACT]&#039;,
    &#039;D&#039;:&#039;[AGT]&#039;,
    &#039;B&#039;:&#039;[CGT]&#039;,
    &#039;X&#039;:&#039;[ACGT]&#039;,
    &#039;N&#039;:&#039;[ACGT]&#039;}

    site = set[input]
    pattern = &#039;&#039;
    positions = []
    for i in site:
        pattern += iupacdict[i]
    searchpattern = re.compile(pattern)
    sites = searchpattern.findall(sequence)
    temppos = searchpattern.finditer(sequence)
    for i in temppos:
        begin, end = i.span()
        positions.append(begin)

    return sites, positions
</pre>
<p>We use the IUPAC dictionary created previously to translate the nucleotides entries in the restriction enzyme file. The function also receives three values: the input name of the selected enzyme, the dictionary with all the enzymes and sites and the sequence where to search. We could easily remove one of those, but let&#8217;s leave it there.</p>
<p>First we get the site from the dictionary and initialize an empty string to receive the patter and a empty list to receive the positions. We will see why we don&#8217;t need an empty list to store the found sites. We then iterate over the site and create a pattern using the values for each letter of the site (dictionary key). Created the patter, we compile the regex and with <code>findall</code> we find every entry of the site in the sequence. As we already have seen, using the regex <code>findall</code> will generate a list with all the entries for that particular regex in the string we are searching. This is pretty handy because some enzymes have degenerated restriction sites. That&#8217;s why we don&#8217;t need a pre-initialized empty list for the sites.</p>
<p>Then we use the <code>finditer</code> to find the the exact position of each one of the sites. Each iterator is a tuple with a start and end positions. In our case we only need the start position, so in a small loop we iterate over the temporary variable and just append to the positions empty list the start value. We have two integers, <code>begin</code> and <code>end</code> that receive the values from <code>i.span</code>, but we only use <code>begin</code>.</p>
<p>The function then returns two lists as a tuple: one for the sites and one for the positions. If our programming is correct, both lists should have the same size and are ready to generate the output.</p>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2007/09/13/restriction-enzymes-third-take/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Restriction enzymes: second take</title>
		<link>http://python.genedrift.org/2007/09/07/restrinction-enzymes-second-take/</link>
		<comments>http://python.genedrift.org/2007/09/07/restrinction-enzymes-second-take/#comments</comments>
		<pubDate>Fri, 07 Sep 2007 21:48:50 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 6]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/2007/09/07/restrinction-enzymes-second-take/</guid>
		<description><![CDATA[We already have a function that reads the enzymes from a dataset in a flat file (with one change: return)

def read_enzymes(file):
    resenz = {}
    start = False
    for line in file:
        if line.find(&#039;Rich Roberts&#039;) &#62;= 0:
    [...]]]></description>
			<content:encoded><![CDATA[<p>We already have a function that reads the enzymes from a dataset in a flat file (with one change: return)</p>
<pre name="code" class="python">
def read_enzymes(file):
    resenz = {}
    start = False
    for line in file:
        if line.find(&#039;Rich Roberts&#039;) &gt;= 0:
            start = True
            line = file.next()
        if start == True and len(line) &gt; 10:&lt;/pre&gt;
            buffer = line.split()
            resenz[buffer[0]] = buffer[-1].replace(&#039;^&#039;, &#039;&#039;)

    return resenz
</pre>
<p>We now need a function to write a function that searches for the sites and a main function that accepts the parameters, coordinate the search and return the results. Looks like we are more than halfway there.</p>
<p>Parameters input was shown before, starting on section 3. We import the <code>sys</code> module and use the array inside <code>sys.argv</code> to send the parameters to the script. A basic skeleton of our main function would look like this</p>
<pre name="code" class="python">
import sys
import re
import fasta

#reading the ezyme dataset in one line and storing
#enzyme information in enzymeset
enzymeset = read_enzymes(open(&#039;bionet.709&#039;, &#039;r&#039;))

#storing enzyme name on a string
enzyme = sys.argv[1]
#reading a FASTA file and sotring the sequences
sequence = fasta.get_seqs(open(sys.argv[2], &#039;r&#039;).readlines())
</pre>
<p>That&#8217;s a start. Now we have to write a function that will check for the enzyme name entered by the user in order to check for the existence of such enzyme. Something like this would work</p>
<pre name="code" class="python">
def check_enzyme(input, set):
    if set.has_key(input):
        return True
    else:
        return False
</pre>
<p>This basically tests of the dictionary contains the name entered. If yes then we return True, otherwise False is returned. This changes our main script body</p>
<pre name="code" class="python">
import sys
import re
import fasta

#reading the ezyme dataset in one line and storing
#enzyme information in enzymeset
enzymeset = read_enzymes(open(&#039;bionet.709&#039;, &#039;r&#039;))

#storing enzyme name on a string
enzyme = sys.argv[1]
#check if the name entered is valid
isname = check_enzyme(enzyme, enzymeset)

#if it is valid, continue, otherwise abort
if isname:
    #reading a FASTA file and sotring the sequences
    sequence = fasta.get_seqs(open(sys.argv[2], &#039;r&#039;).readlines())&lt;/pre&gt;
else:
    print &#039;Name invalid. Please try again.&#039;
</pre>
<p>So, we have a good idea on what to do now. We just need a function that creates a regular expression and searches it on the sequence. Next time &#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2007/09/07/restrinction-enzymes-second-take/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Planning</title>
		<link>http://python.genedrift.org/2007/09/07/planning/</link>
		<comments>http://python.genedrift.org/2007/09/07/planning/#comments</comments>
		<pubDate>Fri, 07 Sep 2007 20:51:43 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 6]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/2007/09/07/planning/</guid>
		<description><![CDATA[Another aspect covered in the book that we haven&#8217;t seen yet is how to plan, design out script or software. Usual ways to design a program include writing use cases and drawing UML diagrams (stands for Unified Modelling Language). Here we will scratch the surface of use cases, where we will try to determine how [...]]]></description>
			<content:encoded><![CDATA[<p>Another aspect covered in the book that we haven&#8217;t seen yet is how to plan, design out script or software. Usual ways to design a program include writing use cases and drawing <a href="http://en.wikipedia.org/wiki/Unified_Modeling_Language">UML</a> diagrams (stands for Unified Modelling Language). Here we will scratch the surface of <a href="http://en.wikipedia.org/wiki/Use_case">use cases</a>, where we will try to determine how the program will interact with the user.</p>
<p>How to write use cases or design UML is not a subject taught in many biology courses, so we won;t see much of the theory here. Consider this a more informal way of planning a small script or application.</p>
<p>First thing would be to set a goal:</p>
<p><i>What is the main objective of our script?</i></p>
<p>Create a simple restriction enzyme map of certain sequences</p>
<p>Next, </p>
<p><i>what do we need to make the program work?</i></p>
<p>We need restriction enzyme information, such as names and sites and a sequence.</p>
<p>That leads us</p>
<p><i>How do I store information?</i></p>
<p>Restriction enzyme data (last entry) obtained from a file can be stored in a dictionary, with the enzyme name as key and the site as value. Sequences are stored using our fasta class.</p>
<p>This will brings us to one important issue</p>
<p><i>How to interact with the user?</i></p>
<p>The ideal way would be to present a list of enzymes for the user to select, but we do not have a graphical interface to organize it nicely. So, we will ask the user to enter the name of enzyme and a name of a file to read the sequence where the enzyme site should be found. We can do that interactively or by passing parameters to the script. We will do it by parameters here, but the interactive way can be tested as a homework for those following the blog.</p>
<p>And finally</p>
<p><i>What about an output?</i></p>
<p>We will do the same way as the book: a list of positions, indicating the restriction sites. I welcome other options in the comments.</p>
<p>It looks we have a plan. Let&#8217;s gather what we already have, and our previous knowledge and meet on the other side of this post.</p>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2007/09/07/planning/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Restriction enzymes: first take</title>
		<link>http://python.genedrift.org/2007/08/31/restriction-enzymes-first-take/</link>
		<comments>http://python.genedrift.org/2007/08/31/restriction-enzymes-first-take/#comments</comments>
		<pubDate>Fri, 31 Aug 2007 21:08:12 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 6]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/2007/08/31/restriction-enzymes-first-take/</guid>
		<description><![CDATA[We now jump with both feet on the main topic of the book&#8217;s chapter, which is generating restriction maps of DNA sequences. First step is to obtain restriction enzyme information, read it and format in a way that our main script will understand. We will use the same dataset as the book, the Rebase database [...]]]></description>
			<content:encoded><![CDATA[<p>We now jump with both feet on the main topic of the book&#8217;s chapter, which is generating restriction maps of DNA sequences. First step is to obtain restriction enzyme information, read it and format in a way that our main script will understand. We will use the same dataset as the book, the <a href="http://rebase.neb.com/rebase/rebase.html">Rebase</a> database at New England Biolabs. In the book, Tisdall suggests the download of a <em>bionet</em> format, which can be downloaded <a href="ftp://ftp.neb.com/pub/rebase/">here</a> (scroll down to bionet.709).  </p>
<p>This file looks like this:</p>
<p><code>AaaI (XmaIII)                     C^GGCCG<br />
AacI (BamHI)                      GGATCC<br />
AaeI (BamHI)                      GGATCC<br />
AagI (ClaI)                       AT^CGAT<br />
AanI (PsiI)                       TTA^TAA<br />
AaqI (ApaLI)                      GTGCAC</code></p>
<p>where the first column contains the enzyme names and the second column has the actual cleavage sites. It won&#8217;t be difficult to parse this file and create a dictionary with the names and sites, but it would be easier if the file the data was tab-separated, but it is nothing we will have problem dealing with. The file also has some header lines which we will have to avoid. </p>
<p>Next entry will deal with user cases and some other aspects of coding, as we head to our most complex script until now. Noe we will only create a function that reads the file and generates the dictionary, what sounds very simple.</p>
<p>There are two ways (maybe even more) to discard the header lines: progamatically or actively deleting such lines. We won&#8217;t delete the lines this time, so we have to follow the other path. The easiest way to eliminate header lines from our parsing function would be to count the lines and start the parsing procedure after that certain number but that would mean a high confidence of the format being constant in every release. </p>
<p>So, first we get rid of the header lines</p>
<pre name="code" class="python">
def read_enzymes(file):
    resenz = {}
    start = False
    for line in file:
        if line.find(&#039;Rich Roberts&#039;) &gt;= 0:
            start = True
            line = file.next()
        if start == True and len(line) &gt; 10:
</pre>
<p>where we already declared the dictionary that will receive the name and sites. We have a flag boolean that tells the script where the actual enzyme list starts (<code>start</code>), declared as false and then modified to true when the line containing <code>Rich Roberts</code> is found. The line <code>line = file.next()</code> tells the script that the current line is equal to the next line of the file. We do this to avoid starting the parsing of the file at the line we found <code>Rich Roberts</code>. the <code>if</code> statement checks for the line size in order to split and parse only the actual lines and discard empty ones.</p>
<p>Now, in order to get the sites and enzyme names we will use split, differently than we used before. This time we won&#8217;t pass any arguments to split as a separator. This will trigger another type of splitting procedure, where some characters will be stripped from the string (spaces, tabs, newlines, returns, and formfeeds) and the resulting words will be then separated by arbitrary length strings of whitspaces. Basicall this split will return a list with the words in that particular line.</p>
<p>Inpecting the file carefully, we notice that some enzyme names have an extra id between parentheses. We will not consider this extra id. After splitting we will get the first and last elements of the list and put them in our dictionary.</p>
<pre name="code" class="python">
buffer = line.split()
resenz[buffer[0]] = buffer[-1].replace(&#039;^&#039;, &#039;&#039;)
</pre>
<p>Done. We have the dictionary ready. And using Python&#8217;s included batteries we use the last line in the function to remove the circumflex characters. Putting everything together we have</p>
<pre name="code" class="python">
def read_enzymes(file):
    resenz = {}
    start = False
    for line in file:
        if line.find(&#039;Rich Roberts&#039;) &gt;= 0:
            start = True
            line = file.next()
        if start == True and len(line) &gt; 10:
            buffer = line.split()
            resenz[buffer[0]] = buffer[-1].replace(&#039;^&#039;, &#039;&#039;)
</pre>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2007/08/31/restriction-enzymes-first-take/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Finding motifs: IUPAC and RegEx, an approach</title>
		<link>http://python.genedrift.org/2007/08/28/finding-motifs-iupac-and-regex-an-approach/</link>
		<comments>http://python.genedrift.org/2007/08/28/finding-motifs-iupac-and-regex-an-approach/#comments</comments>
		<pubDate>Tue, 28 Aug 2007 22:21:33 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 6]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/2007/08/28/finding-motifs-iupac-and-regex-an-approach/</guid>
		<description><![CDATA[After a long delay, we are back. Before entering in the next topic, Restriction Enzymes, let&#8217;s take a look how to create a regex pattern from user input and the dictionary of IUPAC code for nucleotides. We will use the same dictionary from the previous entry

iupacdict = {&#039;M&#039;:&#039;[AC]&#039;,
	&#039;R&#039;:&#039;[AG]&#039;,
	&#039;W&#039;:&#039;[AT]&#039;,
	&#039;S&#039;:&#039;[CG]&#039;,
	&#039;Y&#039;:&#039;[CT]&#039;,
	&#039;K&#039;:&#039;[GT]&#039;,
	&#039;V&#039;:&#039;[ACG]&#039;,
	&#039;H&#039;:&#039;[ACT]&#039;,
	&#039;D&#039;:&#039;[AGT]&#039;,
	&#039;B&#039;:&#039;[CGT]&#039;,
	&#039;X&#039;:&#039;[ACGT]&#039;,
	&#039;N&#039;:&#039;[ACGT]&#039;}

and the same consensus sequence of the GATA3 [...]]]></description>
			<content:encoded><![CDATA[<p>After a long delay, we are back. Before entering in the next topic, Restriction Enzymes, let&#8217;s take a look how to create a regex pattern from user input and the dictionary of IUPAC code for nucleotides. We will use the same dictionary from the previous entry</p>
<pre name="code" class="python">
iupacdict = {&#039;M&#039;:&#039;[AC]&#039;,
	&#039;R&#039;:&#039;[AG]&#039;,
	&#039;W&#039;:&#039;[AT]&#039;,
	&#039;S&#039;:&#039;[CG]&#039;,
	&#039;Y&#039;:&#039;[CT]&#039;,
	&#039;K&#039;:&#039;[GT]&#039;,
	&#039;V&#039;:&#039;[ACG]&#039;,
	&#039;H&#039;:&#039;[ACT]&#039;,
	&#039;D&#039;:&#039;[AGT]&#039;,
	&#039;B&#039;:&#039;[CGT]&#039;,
	&#039;X&#039;:&#039;[ACGT]&#039;,
	&#039;N&#039;:&#039;[ACGT]&#039;}
</pre>
<p>and the same consensus sequence of the GATA3 binding site</p>
<p>NNGATARNG</p>
<p>This consensus sequence will be provided as a script parameter, along with the filename</p>
<pre name="code" class="python">
import sys
import re
sequencefile = open(sys.argv[1], &#039;r&#039;).readlines()
motif = sys.argv[2]
</pre>
<p>First line reads the sequence file from the user input and second line stores the input motif in a string. Now we have to get the motif string and check for each letter (IUPAC code) and get the correspondent set of nucleotides. For this we will use a loop and a the method <code>get</code> method of Python dictionaries. This method, as its name implies, gets the <code>value<code> of the <code>key</code> in parentheses. Like this</p>
<pre name="code" class="python">
for n in motif:
    iupacdict.get(n)
</pre>
<p>If we combine both excerpts above and run using as input the GATA3 model the result would like </p>
<pre name="code" class="python">
[ACGT]
[ACGT]
[AG]
[ACGT]
</pre>
<p>which is five nucleotides short of the motif length. How to correct it? Pretty simple we just add to the dictionary the "regular" nucleotide codes</p>
<pre name="code" class="python">
iupacdict = {&#039;A&#039;:&#039;A&#039;,
	&#039;C&#039;:&#039;C&#039;,
	&#039;G&#039;:&#039;G&#039;,
	&#039;T&#039;:&#039;T&#039;,
	&#039;M&#039;:&#039;[AC]&#039;,
	&#039;R&#039;:&#039;[AG]&#039;,
	&#039;W&#039;:&#039;[AT]&#039;,
	&#039;S&#039;:&#039;[CG]&#039;,
	&#039;Y&#039;:&#039;[CT]&#039;,
	&#039;K&#039;:&#039;[GT]&#039;,
	&#039;V&#039;:&#039;[ACG]&#039;,
	&#039;H&#039;:&#039;[ACT]&#039;,
	&#039;D&#039;:&#039;[AGT]&#039;,
	&#039;B&#039;:&#039;[CGT]&#039;,
	&#039;X&#039;:&#039;[ACGT]&#039;,
	&#039;N&#039;:&#039;[ACGT]&#039;}
</pre>
<p>This would solve our "problem" without adding a line of code to the final script. Fastforwarding to creating the regex</p>
<pre name="code" class="python">
mregex = &#039;&#039;
for n in motif:
    mregex += iupacdict.get(n)

print mregex #just to check

tosearch = re.compile(str(mregex))
for i in tosearch.findall(sequencefile):
    print i
</pre>
<p>Simple and quick. The final output is not really elaborated, but we can improve it. Next time.</p>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2007/08/28/finding-motifs-iupac-and-regex-an-approach/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Finding motifs: IUPAC and RegEx, an overview</title>
		<link>http://python.genedrift.org/2007/07/16/finding-motifs-iupac-and-regex-an-overview/</link>
		<comments>http://python.genedrift.org/2007/07/16/finding-motifs-iupac-and-regex-an-overview/#comments</comments>
		<pubDate>Mon, 16 Jul 2007 16:59:56 +0000</pubDate>
		<dc:creator>Paulo Nuin</dc:creator>
				<category><![CDATA[Section 6]]></category>

		<guid isPermaLink="false">http://python.genedrift.org/2007/07/16/finding-motifs-iupac-and-regex-an-overview/</guid>
		<description><![CDATA[End of Section 5, moving to Section 6. For anyone also following the book there will be a jump at the end of chapter 8, so we are not touching the final script that deals with different reading frames here. We are going straight, or almost, to another take of Regular Expressions. We are going [...]]]></description>
			<content:encoded><![CDATA[<p>End of Section 5, moving to Section 6. For anyone also following the book there will be a jump at the end of chapter 8, so we are not touching the final script that deals with different reading frames here. We are going straight, or almost, to another take of Regular Expressions. We are going to check some aspects of restriction enzymes, but first we are going to touch base with motif finding in DNA sequences. </p>
<p>We already saw how to use the <code>re</code> module and to do some simple regular expression searches. Basically our motif search was very simple, with run-time user input using </p>
<pre name="code" class="python">
motif = re.compile(&#039;%s&#039; % inmotif)
</pre>
<p>where <code>inmotif</code> was the short string sequence entered. Enhancing this a little bit we will first create a more advanced regex search using mismatches in the sequences and after we will see (as in the book) how to translate IUPAC code to regex. The IUPAC table is</p>
<p>A &#8211; Adenine<br />
C &#8211; Cytosine<br />
G &#8211; Guanine<br />
T &#8211; Thymine<br />
U &#8211; Uracil<br />
M &#8211; A or C<br />
R &#8211; A or G<br />
W &#8211; A or T<br />
S &#8211; C or G<br />
Y &#8211; C or T<br />
K &#8211; G or T<br />
V &#8211; A or C or G<br />
H &#8211; A or C or T<br />
D &#8211; A or G or T<br />
B &#8211; C or G or T<br />
X &#8211; A or C or G or T<br />
N &#8211; A or C or G or T</p>
<p>You are going to use basic Python to do this. There are mode advanced ways to reach the same goals, but as we haven&#8217;t seen a lot of dictionaries it would be great to check them again. </p>
<p>The first idea for the table above is to use a dictionary for the codes that represent two or more nucleotides, as dictionary keys would be ideal to include in regex elements. Our basic dictionary would look like this</p>
<pre name="code" class="python">
iupacdict = {&#039;M&#039;:&#039;[AC]&#039;,
	&#039;R&#039;:&#039;[AG]&#039;,
	&#039;W&#039;:&#039;[AT]&#039;,
	&#039;S&#039;:&#039;[CG]&#039;,
	&#039;Y&#039;:&#039;[CT]&#039;,
	&#039;K&#039;:&#039;[GT]&#039;,
	&#039;V&#039;:&#039;[ACG]&#039;,
	&#039;H&#039;:&#039;[ACT]&#039;,
	&#039;D&#039;:&#039;[AGT]&#039;,
	&#039;B&#039;:&#039;[CGT]&#039;,
	&#039;X&#039;:&#039;[ACGT]&#039;,
	&#039;N&#039;:&#039;[ACGT]&#039;}
</pre>
<p>We include the square brackets around each key because these brackets inside a regex indicate a match in one of the characters. This dictionary would be really advantageous for restriction enzymes too.</p>
<p>Now in order to find a motif we need to create a regular expression from the consensus sequence of the motif model, that can be for instance a transcription factor binding site. Let&#8217;s say we have a GATA3 model of binding site which has a model </p>
<p>NNGATARNG</p>
<p>what would give us a regular expression</p>
<pre name="code" class="python">
motif = re.compile(&#039;[ACGT][ACGT]GATA[AG][ACGT]G&#039;)
</pre>
<p>that can be extracted directly from the <code>iupacdict</code>. Next time, we will check how to create these regex on the fly, from input and the dictionary.</p>
]]></content:encoded>
			<wfw:commentRss>http://python.genedrift.org/2007/07/16/finding-motifs-iupac-and-regex-an-overview/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

