Reading FASTA files: introduction
Section 5 July 3rd, 2007Again after a long period we are back. We already have most of the knowledge to create very useful scripts and small programs in Python. And in this post we will create a routine that will make the biologist’s life even easier (when programming Python).
A great part of bioinformatics is to store data, how to store it and which format to use. Anyone working in a wetlab or on a pure bioinformatics lab had had problems with file formats one day or another. In fact, you don’t have to be in a bioinformatics environment to have such problems, but in biology people tend to create a file format for everyt program they write. We can call it a lack of standards, sometimes. But one very well established format is the FASTA (pronounced FAST-Aye, according to the EMBl-EBI page for the software with identical name. This format is really simple and easy to manipulate with most computer languages and being text-based adds an extra advantage of portability for the files. Usually a FASTA file has a structure like this
>title/name/extra information about the sequence
sequence in one or many lines
There is no limit on the number of sequences that can be stored in a file, neither on the size of each sequence. Usually sequences that are larger than 70-80 nucleotides/amino acids are displayed in multiple lines.
>Q15465|SHH_HUMAN Sonic hedgehog protein - Homo sapiens (Human).
MLLLARCLLLVLVSSLLVCSGLACGPGRGFGKRRHPKKLTPLAYKQFIPNVAEKTLGASG
RYEGKISRNSERFKELTPNYNPDIIFKDEENTGADRLMTQRCKDKLNALAISVMNQWPGV
KLRVTEGWDEDGHHSEESLHYEGRAVDITTSDRDRSKYGMLARLAVEAGFDWVYYESKAH
IHCSVKAENSVAAKSGGCFPGSATVHLEQGGTKLVKDLSPGDRVLAADDQGRLLYSDFLT
FLDRDDGAKKVFYVIETREPRERLLLTAAHLLFVAPHNDSATGEPEASSGSGPPSGGALG
PRALFASRVRPGQRVYVVAERDGDRRLLPAAVHSVTLSEEAAGAYAPLTAQGTILINRVL
ASCYAVIEEHSWAHRAFAPFRLAHALLAALAPARTDRGGDSGGGDRGGGGGRVALTAPGA
ADAPGAGATAGIHWYSQLLYQIGTWLLDSEALHPLGMAVKSS
So, let’s write a function and a small script to read a typical FASTA file and display the output. Later we will see how to manipulate the file and in the end we will have created a list of scripts that will be useful for the everyday laboratory life.
But before we get to our final goal, we need to learn some new features of Python. In our case Classes. Classes in Python are very similar to classes in C++, and they are the building blocks of Object Oriented Programming (OOP). This guide will only scratch the surface of OOP, as it is a very complex subject, requiring a guide of its own. There are plenty of good introductory material online and any Google search will return dozens of links.
Let’s focus on some basic concepts, what will be exactly what we will need here. Classes are basically objects with associated properties (attributes) and methods. A traditional introductory exampple of classes is the employee list. In a company all employees are registered and they basic information are stored in the human resources file system. Let’s call the main class Employee with three attributes: name, room_number and favourite_colour. This defines a class. Let’s see how to do that in Python:
class Employee:
def __init__(self, name, room, colour):
self.name = name
self.room = room
self.colour = colour
Let’s dissect this piece of code. First we declare a class and give a name to it Employee. The next thing we need to do is to define the initiation (constructor) method of the class and give it the attributes we need, that’s the __init__ method. Everytime we create a copy of the main class object the copy will be initialized with the values we are passing to the method. According to Dive into Python “The first argument of every class method, including __init__, is always a reference to the current instance of the class.”. That’s the self on the method definition, which is followed by what we want to store in the object: name, room and favourite colour. The lines in the method are the ones assigning the received values to the different class’ attributes. Again from Dive into Python we have “To reference this attribute from code outside the class, you qualify it with the instance name, instance.data, in the same way that you qualify a function with its module name. To reference a data attribute from within the class, you use self as the qualifier.”. In other words the attributes of the class are seen internally by it by the use of self while from the outside (another part of the script) it is seen by the name given to the instance of the class. It is a little bit confusing, so let’s see an example below.
class Employee:
def __init__(self, name, room, colour):
self.name = name
self.room = room
self.colour = colour
employeename = "Paulo"
roomnumber = "21"
colour = "blue"
newemployee = Employee(employeename, roomnumber, colour)
print newemployee
If we run this simple code we will get something like this
<__main__.Employee instance at 0x7379cc>
And that’s exactly what we need here: that’s a class instance that was created by
newemployee = Employee(employeename, roomnumber, colour)
Printing this information is not very useful. We don’t want to know the memory address of the object, we want to get the attributes of it, so we need to change print newemployee by these lines:
print newemployee.name print newemployee.room print newemployee.colour
There they are, run it and see what is being printed to the screen. Of course, as is this script does not accomplish antyhing useful, but imagine that you need to read a file with 1000 new employees everyt month. That’s what we need to do with FASTA files everyday and we will see how to do that on the next post.
July 4th, 2007 at 11:00 am
Hey are you trying to replicate what http://www.pasteur.fr/recherche/unites/sis/formation/python/index.html guys are doing? Reading this post, it reminds me of their http://www.pasteur.fr/recherche/unites/sis/formation/python/ch11s03.html#d0e3774 . I feel if one is serious about using Python [ http://www.python.org/ ] in Bioinformatics, one should check out Biopython [ http://www.biopython.org/ ] as early as possible. Or there is something interesting coming up here later and this is just the teaser to the big event
July 4th, 2007 at 12:50 pm
Hi Animesh
I am not trying to replicate what the Pasteur people is (was) doing. This is just another tutorial on how to write simple Python scripts for Bioinformatics use. It is like the many different Python programming books available: you choose what fits you best.
Clearly, BioPython is one way to go to create Python scripts for bioinfo, but sometimes (1) one wants to know how to do things her/himself and (2) BioPython is not available. I recommend BioPython to whoever wants to get serious about Python and bioinfo.
I don’t know if there is something interesting coming up. It might not be for you but it might be very interesting to someone else. Keep checking and commenting and if you want to see something else here, just let me know.
Paulo
July 5th, 2007 at 10:28 am
Yes I agree with you, one should have choice
I have shamelessly (without asking you) suggested your link as a good alternative to the pasteur thing I clipped for my blog.
I have seen lot of things where people have used Python for Structure related jobs and some places where they have used it for Sequence related job as well, but one thing I have never seen is Python being used for Gene Expression and Evolutionary Biology related work… given the strong numerical (with NumPY) character of Python, better support with OOPs (can actually make good software instead of throwaway scripts I feel) and a flavor of functional programming I am sure it will be a good idea to use these as well as an example.
July 5th, 2007 at 11:34 am
Hi Animesh
I understood your point of view and thanks for linking to this site.
I think that evolutionary biology software are more C/C++ based to take advantage of the speed of computation. Python can be slow sometimes and I don’t see it very well geared to calculate large phylogenetic trees, for instance.
One drawback of writing large Python software for distribution is actually how to distribute it and how to distribute portable solutions. I have some (little) experience writing wxPython interfaces and although Windows and Linux executables are not very large (~8 Mb), OSX standalone packages are huge, around 70 Mb in size for a 500-line program.
Regarding the functional programming, I am planning to add some entries regarding it with some (”funny”) examples for bioinformatics stuff.
Thanks a lot for your comments and please keep commenting and visiting the website. Every opinion is important.
Paulo
July 4th, 2008 at 12:33 am
how to create a module to read a fasta sequenec in Biopython .send the code for this