“Manipulating” Python lists II

Section 1 5 Comments »

As mentioned we will see in this entry some other features of Python lists. We will start with a similar example to the one in the book and then use our DNA file. So let’s assume we have this simple list

nucleotides =  [ 'A', 'C', 'G', 'T']

If we print it directly we would get something like this

['A', 'C', 'G', 'T']

which is fine for now, as we are not worried (yet) with the output (what we will do further below). Let’s remove the last nucleotide. To accomplish that, we use pop with no specific index

nucleotides.pop()

which gives me this when printed

['A', 'C', 'G']

Remember that lists are mutable, so the removed item is lost. We can also remove any other in the list, let’s say ‘C’. First, we reassign the original list items and then remove the second item

nucleotides = [ 'A', 'C', 'G'. 'T']
nucleotides.pop(1)

The list when printed will return this

['A', 'G', 'T']

pop accepts any valid index of the list. Any index larger that the length of the list will return an error. For future reference, remember that when any item is removed (and inserted) the indexes change and the length also. It may seems obvious but mistakes are common.

Shifting from our ‘destructive’ mode, we cal also add elements to the list. Adding to the end of the list is trivial, by using append

nucleotides = [ 'A', 'C', 'G'. 'T']
nucleotides.append('A')

that returns

nucleotides = [ 'A', 'C', 'G'. 'T', 'A']

Adding to any position is also very straightforward with insert, like this

nucleotides = [ 'A', 'C', 'G'. 'T']
nucleotides.insert(0, 'A')

where insert takes two arguments: first is the index of the element before which to insert and second the element to be inserted. So our line above will insert an ‘A’ just before the ‘A’ at position zero. We can try this

nucleotides = [ 'A', 'C', 'G'. 'T']
nucleotides.insert(0, 'A1')
nucleotides.insert(2, 'C1')
nucleotides.insert(4, 'G1')
nucleotides.insert(6, 'T1')

that will result in

['A1', 'A', 'C1', 'C', 'T1', 'T', 'G1', 'G']

Notice that we add every new item at an even position, due to the fact that for every insertion the list’s length and indexes change.

And for last, we will take care of the output. Of course if are creating a script that requires a nicer output, printing a list is not the best way. We could create a loop and merge all entries in the list, but that would be a couple of lines and we ought to have an easier way (otherwise we could be using C++ instead). There is a way, by using the method join. This method will join all the elements in a list into a single string, with a selected delimiter.

nucleotides = [ 'A', 'C', 'G'. 'T']
"".join(nucleotides)

will generate this output

ACGT

join is a method that applies to strings. The first “item” is a string, that could be anything (in our case is an empty one). The code line tells Python to get the empty string an join it to the list of strings that we call nucleotides.

With this we finish the first section of the site and we are moving to chapter 5 in the book.

“Manipulating” Python lists I

Section 1 Comments Off

Now, I want to manipulate my DNA sequence, extract some nucleotides, check lines, etc. Simple things. We start with the same basic code to read the file:

#! /usr/bin/env python
dnafile = "AY162388.seq"
file = open(dnafile, 'r').readlines()

Our nucleotides are stored in the variable file. Remember that each line is one item of the list and the lines still contain the carriage return present in the ASCII file. Let’s get the first and the last lines of the sequence. The first line is easy to get, as Python’s lists start at 0. To access one list item just add square brackets with the index number of the item you want to get (this is also known as slicing). Something like this

file[0]

will return the item 0 from the list, that in our case is the firs line of the sequence. If you add a print command

print file[0]

you should expect

GTGACTTTGTTCAACGGCC....CGTAATCACTTGTTC

The last line is a little bit trickier. Let’s assume that we don’t know the number of lines in the list, and here we want to make our script as general as possible, so it can handle some simple files later. It is also good code practice to think ahead and plan what you want, first to have a detailed project to follow, and second it allows you to be prepared to errors/bugs that your code might have or situations not expected in your original plan.

In our file, we have eight lines of DNA, so it would be just adding this print file[7] and we would output the last line. But, the right way to do it is to check the length of the list and output the item which has an index equal to the list length. In Python, you can check the length of a list by adding the built-in function len before the list name, like this

len(file)

So who do we print the last line of our sequence? Simple. len(file) should return an integer of value 8, which is the actual number of elements in our list. We already know that to access any item in a list we just add its index (that has to be an integer) to the list name. One idea then would be to use len(file) as the index, like this

print file[len(file)]

Why would that be wrong? Our list has eight items, but the indexes are from 0 to 7. So eight would be one index over the list length, which is not accessible because it does not exist. Solution? Let’s use the list length minus one:

print file[len(file)-1]

and there you are, the last line of the sequence. But as we want the last line of the file, which is a list there is an easier way to output just the last line:

print file[-1]

which tells the interpreter to print the last item of a list.

Putting everything together, we have

#! /usr/bin/env python
dnafile = "AY162388.seq"
file = open(dnafile, 'r').readlines()
print 'I want the first line'
print file[0]
print 'now the last line'
print file[-1]

Two points worth mentioning: differently of strings, Python’s lists are mutable, items can be removed, deleted, changed, and strings also can be sliced by using indexes that access characters.

Next we will see some more features of lists and strings, and how to manipulate them. It will probably be the last entry in the first section as we finish the Chapter 4 in the book. As you may have noticed some items in the perl book will not be covered, at least not immediately. We will jump back and forth sometimes.

Reading files in Python: using lists

Section 1 Comments Off

Let’s improve our previous script and put the contents of the file in a variable similar to an array. Python understands different formats of compound data types, and list is the most versatile. A list in Python can be assigned by a series of elements (or values) separated by a comma and surrounded by square brackets

shoplist = ['milk', 1, 'lettuce', 2, 'coffee', 3]

Now, we are going to read the same file and store the DNA sequence in a list and output this variable. The beginning of the script is the same, where we basically tell Python that the file name is AY162388.seq.

#! /usr/bin/env python
dnafile = "AY162388.seq"

We are going to change the way we read the file. Instead of just opening and then reading line-by-line, we are going to open it a read all the lines at once, by using this

file = open(dnafile, 'r').readlines()

Notice the part in bold? In the previous script, we open and store the contents of the file in a file object. Now , we are opening the file and just after it is opened, we are reading all the lines of the file at once and storing them in filefile object, but is a Python's list of strings.

Before, if we wanted to manipulate our DNA sequence, we would had to read it, and then in the loop store in a variable of our choice. In this script, we do that all at once, and the result is a variable that we can change the way we wanted. The code without the output part is

#! /usr/bin/env python
dnafile = "AY162388.seq"
file = open(dnafile, 'r').readlines()

Try putting a print statement after the last line to print the file list. You will get something like this

['GTGACTT...TTGTTC\n', 'TTTAAATA....TAATC\n', 'AGTGA...CTATG\n', 'GAGCTCA....TATAGC\n', ...]

which is exactly the description of a Python’s list. You see all lines, separated by comma and surrounded by square brackets. Notice that each line has a carriage return (\n<\code>) symbol at the end.

Let’s make the output a little nicer including a loop. Remember when I introduced loop I wrote that Python iterates over “items in a sequence of items”, what is a good synonym for list. So the loop should be as straightforward as

for line in file:
    print line</pre>
Putting everything together gives us
<pre lang="python">#! /usr/bin/env python
dnafile = "AY162388.seq"
file = open(dnafile, 'r').readlines()
print file
for line in file:
    print line

that can be downloaded here. Next we will work on improving the output again and maybe modify/convert the list.

Reading files in Python

Section 1 Comments Off

We are currently following Chapter 4 of Beginning Perl for Bioinformatics, which is the first chapter of the book that actually has code snippets and real programming. The last exercises in this chapter deal with the ability to read files and operate with information extracted from these files, to create arrays and scalar list in perl.

We are going to check how to read files in python. The book tells you how to read protein sequences. Here we are going to read DNA and protein sequences from files and change them.

Let say you have a file with a DNA sequence in some directory in your hard disk. The file cannot be a FASTA type (we will see later how to handle FASTA files), just pure sequence, something like this:

GTGACTTTGTTCAACGGCCGCGGTATCCTAACCGTGCGAAGGTAGCGTAATCACTTGTTC
TTTAAATAAGGACTAGTATGAATGGCATCACGAGGGCTTTACTGTCTCCTTTTTCTAATC
AGTGAAACTAATCTCCCGTGAAGAAGCGGGAATTAACTTATAAGACGAGAAGACCCTATG
GAGCTTTAAACCAAATAACATTTGCTATTTTACAACATTCAGATATCTAATCTTTATAGC
ACTATGATTACAAGTTTTAGGTTGGGGTGACCGCGGAGTAAAAATTAACCTCCACATTGA
AGGAATTTCTAAGCAAAAAGCTACAACTTTAAGCATCAACAAATTGACACTTATTGACCC
AATATTTTGATCAACGAACCATTACCCTAGGGATAACAGCGCAATCCATTATGAGAGCTA
TTATCGACAAGTGGGCTTACGACCTCGATGTTGGATCAGGG

You can download the file here. This is a partial sequence of a mitochondrial gene from a South American frog species called Hylodes ornatus. For our purposes you can save the file in the same directory you are going to run the script from, or if you are using the Python interpreter start it in the directory that contains the file.

The file name is not important but we will use AY162388.seq from now on. The first thing we have to do is to open the file for reading. We define a string variable that will contain the file name.

dnafile = "AY162388.seq"

In order to open the file, we can use the command open, that receives two strings: the first is the file name (it can be the whole location too) to be opened and the mode to be used, which is what you want to do with the file. The mode can be one or more letters that tell the interpreter what to do. For now we are going to use the r mode , which tells Python to read the file, and only do that. So our code is

file = open(dnafile, 'r')

file is a file object that contains the directives to read our DNA sequence file. Now, we have actually read the contents of the file but they are stored in a file object and we did not accessed it yet. We can achieve that by using a myriad of commands. We will start with the commonest one: read the file line by line.

file is our file object. We have just opened it, but Python already knows that any file contains lines (remember that this is a regular ASCII file). We are going to use a loop to read each line of the file, one by one

[sourcecode language='python'>for line in file:
print line[/sourcecode]

In Python, the for loop/statement iterates over items in a sequence of items, usually a string or a list (we will see Python’s list soon), instead of iterating over a progression of numbers. Basically, our for above will iterate over each line in the file until EOF (end-of-file) is reached. Our simple script to read a DNA sequence from a file and output to the screen is

#! /usr/bin/env python
dnafile = "AY162388.seq"
file = open(dnafile, 'r')
for line in file:
    print line

You can download the above script here. To run it have the AY162388.seq in the same directory.

Transcribing: the “other” way

Section 1 1 Comment »

We have seen how to transcribe DNA using regular expression, even though the regex we used cannot be considered a real one. Now we are going to simplify our small script even more and take advantage of some string capabilities of Python. Instead of using two lines, we are going to use only one. And also we won’t need to import anything.

Let’s start again with the same DNA sequence

myDNA = 'ACGTTGCAACGTTGCAACGTTGCA'

This time we are going to use replace. This is one of the Python’s methods to manipulate strings. Basically, we are asking the interpreter to replace a certain string by another. The method returns a new copy of your string:

myRNA = myDNA.replace('T', 'U')

This tells Python: myRNA will receive a copy of myDNA where all Ts were changed by Us. the “dot” after myDNA means that the method replace will get that variable as input on that variable.

So our code from above would like this

#! /usr/bin/env python

myDNA = 'ACGTTGCAACGTTGCAACGTTGCA'
myRNA = myDNA.replace('T', 'U')
print myRNA

Simple and efficient. Next we will use the same approach on generating the reverse complement of a DNA sequence, with no regex pattern.

The Regular Expression

Section 1 3 Comments »

As mentioned above, regex in Python are provided by the re module, which provides an interface for the regular expression engine. First thing we have to do is to tell the interpreter what to do and what expression to use.

Let’s start with a DNA sequence.

myDNA = 'ACGTTGCAACGTTGCAACGTTGCA'

How to transcribe it to RNA? Transcription creates a single-strand RNA molecule from the double-strand DNA; basically the final result is a similar sequence, with all T’s changed to U’s. So our regular expression has to find all T nucleotides in the above sequence and then replace them.

Regular expressions in Python need to be compiled into a RegexObject, that contains all possible regular expression operations. In our case we need to search and replace, what can be done by using the sub() method. According to the Python’s Regular Expression HOWTO sub() returns the string obtained by replacing the leftmost non-overlapping occurrences of the RE in string by the replacement replacement. If the pattern isn’t found, string is returned unchanged.

Let’s put everything above in real code. First we need to compile a new RegexObject that will search for all thymines in our sequence. It can be achieved by using this:

regexp = re.compile('T')

Simple as that. This line of code tells the Python interpreter that our “regular expression” is every T in our string. Now, we have to make replace those Ts with Us. In order to do that we just tell the interpreter:

myRNA = regexp.sub('U', myDNA)

Let’s look at the last two lines of code. On the first line we created a new RegexObject, regexp (that could have any name, as any variable) and compiled it, making our regular expression to be every T in our string. On the second line, we assigned our soon to be created RNA sequence to a new string (remember that strings in Python are immutable) and used the command sub to replace in the Ts by Us present in our original DNA string. Putting all together our transcription code will be

#! /usr/bin/env python

import re
myDNA = 'ACGTTGCAACGTTGCAACGTTGCA'
regexp = re.compile('T')
myRNA = regexp.sub('U', myDNA)
print myRNA

You can download the resulting script here.

Python printing statement

Section 1 6 Comments »

Just an apart from the bioinformatics aspect of programming: Python’s print statement.

As in most computer languages Python allows an easy way to write to the standard output. Python’s print only accepts output of strings, and if the variable sent to it is not a string it is first converted and then output.

The print always put a linebreak ('\n' or "\n") at the end of the expression to be output, except when the print statement ends with a comma. For example:

print "This is a"
print "test"

will print

This is a

test

while

print "This is a",
print "test",

will print

This is a test

Of course Python’s print statement allows any programming escape characte, such as '\n' and '\t'.

Concatenating strings on output

To concatenate two strings on output there are two possible ways in Python. You can either separate the strings with a comma, like we did here

print myDNA, myDNA2

or you can use the “+” sign in roder to obtain almost the same result. This is similar to what was used here

myDNA3 = myDNA + myDNA2

but instead we would use the print command as

print myDNA3 + myDNA

In the latter case, both strings will not be separated by a space and will be merged. To get the same result you would have to concatenate an extra space between the strings like

print myDNA3 + " " + myDNA

Importing and regular expressions

Section 1 Comments Off

Tisdall’s book on Perl introduces next the ability to transcripts DNA sequences into RNA. In order to do that we need to check a different aspect of programming: regular expressions (or regex). Regular expressions is a pattern/string expression that works matching/describing/filtering other strings.

Let’s say you want to examine or extract all vowels contained in one phrase, one page, one word. Another example would be to remove all html tags from a downloaded webpage. As HTML tags are encapsulated between < and > signs we can create a regex that will search for any characters in between the signs and remove (parse) them from our page. We will deal very briefly with regex, and if you are interested in learning more about it you can search for countless references on the internet (such as this one).

In order to use regular expression in Python we need to check another concept present in the language: importing modules. Python core functionality provides most of the usual operations and also comes with a built-in library of functions that "access to operations that are not part of the core of the language but are nevertheless built in". One of this operations is the ability to interpret regular expression that in Python is located in the re module. Apart from the language core, built in modules, Python can be further extended by using third-party modules imported into the language. Anyone can create a module and distribute to every Python user and programmer.

So, in order to have regex capabilities we literally have to import the regex module. We do that by entering the line:

import re

Python's code style guide suggests that import statements should be on separate lines

import sys
import re
 ...

and to be put on the top of the file (usually below the line that tells your OS to use Python's interpreter on the script).

So the first two lines of our new script would be

#! /usr/bin/env python
import re

Now that we have the ability to use regex, we need to create one expression that will transcribe our DNA sequences into RNA.

Sequences and Strings – part II

Section 1 4 Comments »

Another important task for many biologists is to merge/concatenate different strings of DNA in one unique sequence. We can modify the previous script to concatenate two distinct DNA sequences in one.

We start using code_01 structure, adding some extra elements (line 3):

#! /usr/bin/env python
myDNA = "ACGTACGTACGTACGTACGTACGT"
myDNA2 = "TCGATCGATCGATCGATCGA"
print myDNA, myDNA2

So far, we added a new string containing an extra DNA sequence and we print both sequences. In Python print statement automatically adds a new line at the end of the string to be printed, unless you add a comma (,) to the end. The comma is also needed if you are going to print more than one string in order to separate them (try removing the comma from the code above).

Now, how do we merge myDNA and myDNA2? Easy in Python: just sum them with a plus signal:

myDNA3 = myDNA + myDNA2
print myDNA3

Notice that in Python strings are immutable, meaning they cannot be changed. This immutability confers some advantages to the code where strings (in Python strings are not variables) cannot be modified anywhere in the program and also allowing some performance gain in the interpreter. So, in order to have our sequences merged we create a third sequence that receives both strings. Finally our code will be (some captions were added):

#! /usr/bin/env python
myDNA = "ACGTACGTACGTACGTACGTACGT"
myDNA2 = "TCGATCGATCGATCGATCGA"
print "First and Second sequences"
print myDNA, myDNA2
myDNA3 = myDNA + myDNA2
print "Concatenated sequence"
print myDNA3

Easy, eh? Of course these two simple scripts do no scratch the surface of Python programming, but they are a start.

The above code can be downloaded from the repository.

Hands on code: Sequences and strings – part I

Section 1 2 Comments »

As pointed in Beginning Perl for Bioinformatics, a large percentage of bioinformatics procedures deals with strings, especially DNA and amino acids sequence data. As is largely known DNA is composed of four different nucleotides: A, C, T and G and proteins can contain up to 20 amino acids. Each one of these elements have one letter of the alphabet assigned to them. In the DNA case some letters represent one or more nucleotides that can be present at some sequence position (click here for more ).

So, as the amino acid is the basic building block of proteins (AKA polypeptides), strings containing sequence is our most basic block, from where all the bioinformatics magic will work on.

Usually in Perl a string is represented by the dollar sign in front of the variable name, like this $sequence. Python is dynamically typed, meaning variable types are assigned/discovered by the interpreter at run time. This means that the value after the equal sign will tell the interpreter what variable type you are declaring. So in Python if you want to store a DNA sequence you can just enter

mydna="ACGTACGTACGTACGTACGTACGT"

a quick note: Python can be used with the interpreter command line or by previously saved scripts. I will try to use the latter in the code examples.

OK, we are ready to create our first Bioinformatics Python Hello World script. Let’s get the sequence above and print it on the screen. The first line will tell the operating system what to use and where to find the Python interpreter

#! /usr/bin/env python

Next we will create the variable myDNA and assign the corresponding sequence

myDNA = "ACGTACGTACGTACGTACGTACGT"

And finally, we will print the contents of the variable to the screen:

print myDNA

As mentioned above, Python mandates that you have your code indented, but in our final script this is not needed:

#! /usr/bin/env python
myDNA = "ACGTACGTACGTACGTACGTACGT"
print myDNA

The first line tells your operating system that this is a Python script and to use the interpreter located in that directory; line two declares a variable called myDNA and assigns the sequence string to it and the last line simply output this variable to the screen. That simple!

To run this (extremely simple) script you can copy and paste the code above to your favourite text editor save the file with a .py extension (recommended but not necessary). To run the script, as long as you have Python installed, just open a shell and type on the command line:

> python code_01.py

To download the above script check the repository.