Mar 14

Closing section two, let’s use everything we saw before and write a nice script that will read a sequence file (DNA) and report us of any “errors” and the number of different nucleotides. It is very simple, but a good exercise. Depending on your needs you can easily modify it to check for other characteristics of sequences, even change it to read amino acids sequences. We start with the code, comments coming after it.

#!/usr/bin/env python

import sys
import re

fileentered = True
while fileentered == True:
    filename = raw_input('Please enter a file to check: ')
    if len(filename) >= 1:
        try:
            seqlist = open(filename, 'r').readlines()
	    sequence = ''.join(seqlist)
	    sequence = sequence.replace('\n', '')
            totalA = sequence.count('A')
            totalC = sequence.count('C')
            totalG = sequence.count('G')
            totalT = sequence.count('T')
            otherletter = re.compile('[BDEFHIJKLMNOPQRSUVXZ]')
	    extra = re.findall(otherletter, sequence)
	    output = open(filename+'.count', 'w')
	    output.write('Count report for file ' + filename + '\n')
	    output.write('A = ' + str(totalA) + '\n')
	    output.write('C = ' + str(totalC) + '\n')
	    output.write('G = ' + str(totalG) + '\n')
	    output.write('T = ' + str(totalT) + '\n')
	    if len(extra) > 0:
	    	output.write('Also were found ' + str(len(extra)) + ' errors\n')
	    	for i in extra:
		    output.write(i + ' ')
	    else:
		output.write('No error found')
            output.close()
	    print 'Result file saved on ' + filename + '.count'
        except:
            print 'File not found. Please try again.'
    else:
        sys.exit()

The main change here is that we use a while loop to control de program flow. Notice that we import sys and re. Basically we ask for an user input, the filename, and depending on the input given we process the file or exit the program. The early exit is done with the sys.exit method which is a shortcut to get out of the script processing. Very handy. If the input is valid we try to open it. If the operation is successful, great, we read the file, count the nucleotides and use a quite scary regular expression to search all the “errors” in our sequence.

The regex is compiled with the pattern '[BDEFHIJKLMNOPQRSUVXZ]' which means “match any character in this range”. Take a closer look at the pattern and you will see that is every letter in the alphabet, except for ACGT. And when searching for these “errors” instead of using re.search we use re.findall, which conveniently returns a list with all the substrings found. That’s even more handy.

On the final part of the script we take care of the output, opening a file called <whatever>.count where we print the counts and the errors, if they actually exist. One thing I left from the previous post, is that we need to close the file opened to write. In some cases if the file is not properly closed, errors might occur. So always remember to close the files used for writing/appending, using <filename>.close().

Mar 13

We already know how to read from files, now we are going to see how to write to them. We will start with a simple example, writing some content to a “fresh” file that does not exist in the system.

Sometime ago, we have seen the print statement in Python, that prints to the system standard output (usually the screen). This output can be redirected using > to a stream/file. But if we are going to creat really professional applications (even to our own use), usyally stream redirection is not really the nicest approach. In some cases the best alternative is to save a file.

Also, some posts ago, we covered the methodology to open a file. We are going to use here the same command, open to (in our case) create the file. Remember that to read the file we used

file = open(dnafile, 'r')

where the 'r' is the mode we are using to open the file. In that case we used the read mode, now we are going to use the write mode. When we open a file with this command in write mode if the file (with the name with pass) does not exist it is created. If it does exist, all the contents of the current file will be erased, so be careful when opening a file from your script. Check for the location, file name, etc before opening the file.

So, basically we have

file = open(output, 'w')

and the file is open and ready to receive data.

Let’s cheat and use the previous script that counts nucleotides and modify it to save a count.txt file wit the results:

#!/usr/bin/env python

#let's keep the file fixed for now
dnafile = "AY162388.seq"
#opening the file, reading the sequence and storing in a list
file = open(dnafile, 'r')
#initialize a string to receive the data
sequence = ''
for line in file:
    sequence += line.strip() #notice the strip, to remove n
#"exploding" the sequence in a list
seqlist = list(sequence)
#initializing integers to store the counts
totalA = 0
totalC = 0
totalG = 0
totalT = 0
#checking each item in the list and updating counts
for base in seqlist:
    if base == 'A':
        totalA += 1
    elif base == 'C':
        totalC += 1
    elif base == 'G':
        totalG += 1
    elif base == 'T':
        totalT += 1
#printing results
resultfile = open('counts.txt', 'w')
resultfile.write(str(totalA) + ' As found \n')
resultfile.write(str(totalC) + ' Cs found \n')
resultfile.write(str(totalG) + ' Gs found \n')
resultfile.write(str(totalT) + ' Ts found \n')

The only difference is at the end of the script. Here instead of print we use write. Notice that write is a method of the opened file.(in our case called resultfile. This method only accepts strings, so we need to convert everything to string before writing in the file. Notice also that we need to add a carriage return/newline at the end of the string to be written. Differently of print, write does not automatically puts a new line at the end of the output.

Mar 13

As I have not updated the website in a couple of weeks, I will try to catch up with the book instead of revisiting (for the time being) our past scripts. I am going to finish the book’s chapter 5 (our section 2) in the next post, where I will give a small introduction on how to output data to a file in Python. On this post we will check some of the methods that can be used to manipulate strings. The book gives only a couple of methods to be used in perl on string, but here I will show a longer list of Python methods that can be used on its immutable strings.

A full list of the methods can be found here and I will will give brief explanantions on the ones I think are key for bioinformatics. Remember that string cannot be changed in Python, so we will always going to use a buffer/temp variable to store our changed string when needed.

- count this method returns the number of times you see a substring (a letter/number, a word, etc) in another string. You can also specify a start and an end positions to look for. This is handy if you are counting nucleotides/aaminoacids in a sequence. This method returns an integer

- endswith this method checks the end of your string for a determined substring. Very handy of you need to check the tail end of your sequence right away.

- find returns the position of the substring being searched, and -1 if it is not found. It is similar to the re.search we used before. This is very useful if you are looking for a determined motif/subsequence in a hurry.

- islower returns a true bool (true or false) if all characters in the string are lowercase. False if at least one of the characters is uppercase. Very handy if you need to convert lowercase to UPPERCASE files for input in some application. Works in conjunction with the isupper which is basically the opposite.

- join. We have seen this before: it concatenates strings using a determined separator.

- lower and upper, that as their name might indicate return the string converted to lowercase/uppercase. This is also good for the conversion of sequence format in input files.

I will stop here. There are many other methods that can be used. Some of the above were already covered here and in the next posts we will take a look at the other ones, creating an application that actually performs some useful function.

Feb 22

Let’s check the “short” way, that is basically a method that avoid the “explosion” of the string. Instead of transforming the sequence from the file from a string to a list, we go and use the string directly, applying one of the methods available to manipulate strings. This time we read the file at once and convert the list to a string using join. To count we simply use the method count on our string. Pretty nice.

#!/usr/bin/env python
#still keep the file fixed
dnafile = "AY162388.seq"
#opening the file, reading the sequence and storing in a list
seqlist = open(dnafile, 'r').readlines()
#let's join the the lines in a temporary string
temp = ''.join(seqlist)
#counting
totalA = temp.count('A')
totalC = temp.count('C')
totalG = temp.count('G')
totalT = temp.count('T')
#printing results
print str(totalA) + ' As found'
print str(totalC) + ' Cs found'
print str(totalG) + ' Gs found'
print str(totalT) + ' Ts found'

That’s the short way: using count. This is a method that when applied on a string counts the number of times the substring appears in our string. We can even set a start and ending point to count. Notice one difference in this script to the previous examples: after we join the items of the list into a string we do not remove the carriage returns. Because in this case we don’t need it, as it will be an extra character there that won’t make any harm.
Download the above code from the repository.

Feb 22

We are going to jump the regular expression explanation that you can find in the book, and we will go directly to the section that introduces string/list/array manipulation and adapt it to Python.

We are going to see two different methods: a “long” and a “short” one. In many places and computer languages you will see that there are different ways of doing the same thing, with advantages and disadvantages. It is up to you to define which methods are better or worse, as this is a very personal matter. Some people prefer the longer way because it might be clearer and easier to understand, or it might be necessary to use it due to code maintainability. Other people would rather use the shorter path because they want to generate code faster.

It does not matter the path you select, as long as you get your task done. That’s the key: focus on the end product not on how exactly got there. If you used 10 lines of code more, or 10 less, that’s irrelevant as long as you did what you wanted. Live and learn and someday you will use the even shorter way.

So we start with the long way. We want to count the individual number of nucleotides in a sequence, determining they relative frequency. Remember that we read the lines of a sequence file into a list, using readlines. But on the long method we will read the file and store the data into a string like

file = open(filename, 'r')
sequence = ''
for line in file
    sequence += line

This way it will be easier to “explode” the sequence in separated items. After the “explosion” we can check each item in the list and get our result. The code is below, I will be back after it.

#!/usr/bin/env python
#let's keep the file fixed for now
dnafile = "AY162388.seq"
#opening the file, reading the sequence and storing in a list
file = open(dnafile, 'r')
#initialize a string to receive the data
sequence = ''
for line in file:
    sequence += line.strip() #notice the strip, to remove \n
#"exploding" the sequence in a list
seqlist = list(sequence)
#initializing integers to store the counts
totalA = 0
totalC = 0
totalG = 0
totalT = 0
#checking each item in the list and updating counts
for base in seqlist:
    if base == 'A':
        totalA += 1
    elif base == 'C':
        totalC += 1
    elif base == 'G':
        totalG += 1
    elif base == 'T':
        totalT += 1
#printing results
print str(totalA) + ' As found'
print str(totalC) + ' Cs found'
print str(totalG) + ' Gs found'
print str(totalT) + ' Ts found'

Most of the above code was already covered before. Let’s look at the different stuff, like the “explosion line”

seqlist = list(sequence)

Here we basically transform our string sequence into a list, by putting the object type we want before the object we want converted, like we do here

print str(totalA) + ' As found'

This construction type_to_convert(to_be_converted) tells Python to get whatever is inside the parentheses and transform into whatever is outside. Converting the string to a list will get

GTGACTTTGTTCAACGGCCGCGGTATCCTAACCGT
GCGAAGGTAGCGTAATCACTTGTTCTTTAAATAAGGACTAGTATG
AATGGCATCACGAGGGCTTTACTGTCTCCTTTTTCTAATCAGTGAA
...

and transform it into

['G', 'T', 'G', 'A', 'C', 'T', 'T', 'T', 'G', 'T', 'T', 'C', 'A', 'A', 'C', 'G', 'G', 'C', 'C', 'G', 'C', 'G',
 'G', 'T', 'A', 'T', 'C', 'C', 'T', 'A', 'A', 'C', 'C', 'G', 'T', 'G', 'C', 'G', 'A', 'A', 'G', 'G', 'T', 'A', 'G', 'C', 'G', 'T',
'A', 'A', 'T', 'C', 'A', 'C', 'T', 'T', 'G', 'T', 'T', 'C', 'T', 'T', 'T', 'A', 'A', 'A', 'T', 'A', 'A', 'G', 'G', 'A', 'C', 'T', 'A',
 'G', 'T', 'A', 'T', 'G', 'A', 'A', 'T', 'G', 'G', 'C', 'A', 'T',...]

After “exploding”, we use a for loop to iterate over every item in the list and use conditional statements to do the counts. Regarding the counts, we use this operator

totalT += 1

that, in C/C++, tells the interpreter to get the value of totalT and add 1 to it.

On the next post, we will see the short way and further modify the above script, that can be downloaded from the repository.

Feb 22

As you might have noticed from the previous posts, comments in Python are defined mainly by the # sign. This is the signal used for single line comment like

#this is a single line comment

If you are used to C++, this would be equivalent to //.

On the other hand, multi line comments are defined by triple double quotes """, opening and closing, similar to C++ /* ... */, like this

"""this is a multi
line comment"""

Just a note. Resuming bioinformatics mode.

Feb 20

As promised, let’s change a bit our previous code, and make it more effective. If you are reading this tutorial in one-entry mode, let’s check the code

#!/usr/bin/env python
import re
dnafile = "AY162388.seq"
seqlist = open(dnafile, 'r').readlines()
temp = ''.join(seqlist)
sequence = temp.replace('\n', '')
inputfromuser = True
while inputfromuser:
    inmotif = raw_input('Enter motif to search: ')
    if len(inmotif) >= 1:
        motif = re.compile('%s' % inmotif)
        if re.search(motif, sequence):
            print 'Yep, I found it'
        else:
            print 'Sorry, try another one'
    else:
        print 'Done, thanks for using motif_search'
        inputfromuser = False

In order to make it more effective, let’s allow the input of any file, maybe asking for the file as soon as the script is started. To accomplish that we need to “protect” the script and make it error proof (almost at least). Yes, you thought it right: we need to check if the input file exists before opening it.

We can use try ... except statements to do that. Pretty handy. This is called an exception handler, so basically we try the validity of some command/method and depending on the result we continue our program flow or we catch the exception and do something else. Under try we put the expression to evaluate, and under excepts what to do in case of failure. Like this

try:
    opening a file
except:
    as the file does not exist, print something

So, let’s put in our code above. We remove this

dnafile = "AY162388.seq"
seqlist = open(dnafile, 'r').readlines()

and include this lines

fileinput = True
while fileinput == True:
    filename = raw_input('Enter file name:')
    if len(filename) > 0:
        try:
            dnafile = open(filename, 'r')
            fileinput = False
        except:
            print 'File does not exist, please try again'
    else:
        sys.exit()

I know, a lot of new code. But if you take a closer look, there is only three lines we have never seen: try except and the last line with sys.exit(). Here, the script tries to open the file provided as input, if it does find the file normal operation resumes, if does not, the script asks for another input. As in the other while loop, we control it with a boolean variable, and in the case of empty input we end the loop and the script, using a system command exit, in the last line of the new code.

Our script is better now, not bulletproof, but a little bit more efficient. There still a “flaw”, that you can only check one file for each run of the script. At least we not stuck to our usual DNA sequence.

Download the commented code on the repository.

Feb 14

As you might have noticed, BPB generally uses protein sequences. I am a “DNA guy”, and basically in our simple examples either type of sequence (except the example on transcribing) could have been used. I will stick with this molecule for a while, or until I can.

We are going to use our good old AY162388.seq file, still assigning the file name inside the script there will be a twist in the end. Also, remember the Regular Expression module? We are going to use it too. I am going to do just as the book: I will paste the code below with comments in the middle and explain it afterwards.

#!/usr/bin/env python
#finding motifs in DNA sequences
# we use the RegEx module
import re
#still keep the file fixed
dnafile = "AY162388.seq"
#opening the file, reading the sequence and storing in a list
seqlist = open(dnafile, 'r').readlines()
#let's join the the lines in a temporary string
temp = ''.join(seqlist)
#assigning our sequence, with no carriage returns to our
#final variable/object
sequence = temp.replace('\n', '')
#we start to deal with user input
#first we use a boolean variable to check for valid input
inputfromuser = True
#while loop: while there is an motif larger than 0
#the loop continues
while inputfromuser:
    #raw_input received the user input as string
    inmotif = raw_input('Enter motif to search: ')
    #now we check for the size of the input
    if len(inmotif) >= 1:
        #we compile a regex with the input given
        motif = re.compile('%s' % inmotif)
        #looking to see if the entered motif is in the sequence
        if re.search(motif, sequence):
            print 'Yep, I found it'
        else:
            print 'Sorry, try another one'
    else:
        print 'Done, thanks for using motif_search'
        inputfromuser = False

Now let’s dissect the code, in a biological way. We already seen everything up to the part the list’s lines are joined. First new lines for us is this

sequence = temp.replace('\n', '')

Most of the methods in Python have very intuitive names (ok, most languages do), so it is easy to deduce that replace actually replaces something. And why do we need to use this method? First, we joined the lines in one temporary string (yep, strings are immutable), but the lines come with everything, including carriage returns that we need to get rid of. So our “final” string sequence receives the value in temp and we apply the method replace to modify it. replace has two arguments, the first is the string we want to change from, while the second is the string we want to replace with (in fact replace accepts three arguments, the last being the number of times we want to do the change). In our case, we are telling the interpreter to get every substring '\n' and put an empty one in their places.

The next line is a simple value assignment:

inputfromuser = True

and the variable will manage the while that checks input from the user. Basically we will run the loop until a certain type of input is given, that will make the variable value become False. That’s why we have the line

while inputfromuser

with no check. Yep, notice that we don’t need to check for the variable’s value, Python assumes that it is True. To understand better, imagine that inputfromuser is a flag that appears when True and disappears when False. Python can only “see” it when it appears, and when it does disappear it needs to check for it. We will elaborate more later.

Another different line awaits us

inmotif = raw_input('Enter motif to search: ')

raw_input is a function that takes a line input by the user and returns a string. In our case, we assign the value returned by the function to a new string called inmotif. Also, raw_input has one prompt argument which is written to the screen with no trailing line, waiting for the user to input something. This takes us to the while loop

if len(inmotif) >= 1:
        #we compile a regex with the input given
        motif = re.compile(r'%s' % inmotif)
        #looking to see if the entered motif is in the sequence
        if re.search(motif, sequence):
            print 'Yep, I found it'
        else:
            print 'Sorry, try another one'
    else:
        print 'Done, thanks for using motif_search'
        inputfromuser = False

After getting the input from the user, we need to check the size of such input: if it is greater or equal to one, we will search it in our sequence, otherwise we will finish the loop. So if we get a valid input (valid in the sense of size, for now) we will compile the regex and search for it.

There is a difference in regex compilation. In the DNA transcribing we assigned a string to the regex directly, now we have a string coming from a variable/object

motif = re.compile(r'%s' % inmotif)

Notice the difference in the argument that is passed to the compile function. In this case the regex to be compiled is coming from the string entered by the user, and we have to pass it by using Python’s string formatting operator, noted as a %. This formatting operator uses two parameters: a string with formatting characters and the data to be formatted, separated by the percent sign. In our case the formatting character will receive a string, hence the %s (s for string), and the data to be formatted that is the input.

Previously, we used the regex function to replace characters/substrings in a sequence. This time, we are interested to know if the motif entered by the user is in our sequence. So we use the search method instead

if re.search(motif, sequence):

Notice that we do the search in at the same time we are testing for its result. The same case as the checking we do at the while loop. If there is a positive result from the regex search a True flag will be raised and the interpreter will execute the code of the initial branch, not testing for the elif and else

print 'Yep, I found it'
        else:
            print 'Sorry, try another one'

This condition is nested inside another condition, the one that tests for the size of the input entered. So, if our search returns a regex object, we print Yep, I found it, otherwise the user will receive Sorry, try another one. Now back to our upper if, if the user input length is equal to zero (just pressing the Enter key) the interpreter will process the line

print 'Done, thanks for using motif_search'

and then

inputfromuser = False

that will tell the while loop that there is no True variable anymore, ending the loop and consequently our script.

Let’s review the script and its flow:

1) assign a filename to be opened
2) read the file
3) join the lines
4) remove carriage returns
5) ask for user input, while is valid
6) if input length is greater or equal to 1, process it
     6a) compile regex with input
     6b) search for it
          6b1) if it is found print yes
          6b2) if it is not found print no
7) if input length is equal to zero, end while loop,
end string

I know it is a lot of information, take your time. The code can be downloaded from the repository. Next I will change a bit this script using other methods to find the motifs, and making the promised twist in the method that reads the file.

Feb 13

We start following the fifth chapter of BPB. The first item is about flow control and code layout, which are very relevant for our tutorial. We already introduced briefly both aspects in past entries on the site, but it is always good to check. As the book, I will start with flow control.

Flow control is the way your instructions are executed by the program. Most languages have a linear flow control, meaning every line is executed from top to bottom. This linear flow control can be disrupted by two types of statements: looping and branching. Looping statements tell the computer to execute a determined set of commands until certain condition is met. This can be a numeric value (ie from 1 to 100) or the number of items in a list (like our shop list from before). Branching statements are also known as conditional statements, tell the computer to execute/or not determined lines depending on certain conditions.

The for loop was shown before. In its for loop Python iterates over the elements in a list like this

for lines in text:
    <i>do something</i>

Notice that the first line of the loop ends in a colon. This and the word for in the line> tell the interpreter that this a for loop and the indented block below is the code to be executed repeatedly until the last element in the list is reached. How Python knows where the loop ends? Indentation. Many languages use curly braces, parentheses, etc. In Python the loop ends by checking the indentation level of lines (this will help us a lot when discussing code layout).

We’ve already seen one example of loop in Python, for, but Python accepts other types of loop structures, such as while, that uses the same indented properties to execute the commands.

[sourcecode]mycounter = 0
while mycounter == 0:
do something[/sourcecode]

Take a closer look at the while line. See something different? Maybe not, if you are used to programming. In Python equality is tested with a double equal sign (==), while a sole equal sign (=) assigns a value to a variable. You can also test for inequality, greater and less than, with !=, < and > respectively.

Attention: is quite common to generate errors by substituting == by =, so pay attention when coding.

Branching statements are the conditional commands in a computer language, usually governed by if ... then ... else. In Python a branching statement would look like

if value == 1:
    do something
elif value == 2:
    do something else
else:
    do even more

Notice the colon ending each line of the conditions and again the indented code, telling the interpreter where the corresponding code for each condition ends.

We are going to use a lot of conditions and loops, but as you might have noticed Python has some tricks that make us avoid these statements.

About code layout just one word: indentation. No brackets, parentheses, curly braces, etc. Yes, we have seen brackets and parentheses, but not to tell the interpreter where loops and conditions start and end. This is taken care by indentation, making our life easier and the code more beautiful.

Next we will see Python’s ability to find motifs in words, mainly on DNA sequences.