Finding motifs: IUPAC and RegEx, an approach
Section 6 August 28th, 2007After a long delay, we are back. Before entering in the next topic, Restriction Enzymes, let’s take a look how to create a regex pattern from user input and the dictionary of IUPAC code for nucleotides. We will use the same dictionary from the previous entry
iupacdict = {'M':'[AC]',
'R':'[AG]',
'W':'[AT]',
'S':'[CG]',
'Y':'[CT]',
'K':'[GT]',
'V':'[ACG]',
'H':'[ACT]',
'D':'[AGT]',
'B':'[CGT]',
'X':'[ACGT]',
'N':'[ACGT]'}
and the same consensus sequence of the GATA3 binding site
NNGATARNG
This consensus sequence will be provided as a script parameter, along with the filename
import sys import re sequencefile = open(sys.argv[1], 'r').readlines() motif = sys.argv[2]
First line reads the sequence file from the user input and second line stores the input motif in a string. Now we have to get the motif string and check for each letter (IUPAC code) and get the correspondent set of nucleotides. For this we will use a loop and a the method get method of Python dictionaries. This method, as its name implies, gets the value<code> of the <code>key in parentheses. Like this
for n in motif:
iupacdict.get(n)
If we combine both excerpts above and run using as input the GATA3 model the result would like
[ACGT] [ACGT] [AG] [ACGT]
which is five nucleotides short of the motif length. How to correct it? Pretty simple we just add to the dictionary the “regular” nucleotide codes
iupacdict = {'A':'A',
'C':'C',
'G':'G',
'T':'T',
'M':'[AC]',
'R':'[AG]',
'W':'[AT]',
'S':'[CG]',
'Y':'[CT]',
'K':'[GT]',
'V':'[ACG]',
'H':'[ACT]',
'D':'[AGT]',
'B':'[CGT]',
'X':'[ACGT]',
'N':'[ACGT]'}
This would solve our “problem” without adding a line of code to the final script. Fastforwarding to creating the regex
mregex = ''
for n in motif:
mregex += iupacdict.get(n)
print mregex #just to check
tosearch = re.compile(str(mregex))
for i in tosearch.findall(sequencefile):
print i
Simple and quick. The final output is not really elaborated, but we can improve it. Next time.
November 1st, 2007 at 4:08 pm
[...] in a FASTA file (again, this is a very simple example, just a primer). Of course we can use another method, but this time we want to use a functional programming approach. filter suits us best here. Again [...]