Beginning Python for Bioinformatics

a step-by-step guide to create Python applications in bioinformatics

Goodbye and so long

After 157 posts, I’m saying goodbye to Blind.Scientist to Beginning Python for Bioinformatics and the main genedrift.org page. I’m renovating anything, and soon this space will be something else.

Thanks for reading, thanks for the support and, overall, thanks for not reading.

So long!

Rebuilding

The blog engine (WP) has been hacked last week and I’m still rebuilding some of its features. Sorry for any inconvenience.

Why I left Biostar, but I still like Stackoverflow

About eight months ago I started using Biostar as I saw it as a great opportunity to exchange some ideas, concepts, tips in biology and bioinformatics. I even mentioned the website in this space, trying to bring more people to the mix; at the time the community wasn’t big enough, and some days went by without any question being posted.

But a couple of months ago my interest started to go down the drain. I don’t know if it was the constant next-generation sequencing barrage of questions every day, if it was the infantile blog/twitter posts from members competing for points or maybe the lack of votes for some answers that I posted (that’s selfish on my part, I admit). But at some point it seemed that the website turned into a competition of CVs or knowledge, very different from what I could see in different Stackoverflow spin-offs or in the main site. I guess the turning-point, or the moment I realized that the scientific community (at least in bioinformatics and related fields) will never the be the same as the programming and statistical ones, was the time I gave an answer that had less votes that the one saying “it’s not possible”.

Maybe the problem is myself, I don’t like cliques, don’t mind helping people for nothing, don’t care about reputation. I didn’tt care about how many points I had, and used the down-vote to actually vote down answers that I didn’t see as pertinent (if you never used those sites, every down-vote removes one point from your score). I still think that Biostar is a great idea, and I wish it becomes a great resource for all the bio fields. Maybe if the community gets big enough, maybe if don’t see the same group of people that you see every where else it might become a better place to hang-out online. But right now, I’m over it.

Preview of Django 1.1 Testing and Debugging

Packt Publishing invited me to review Django 1.1 Testing and Debuggingby Karen M. Tracey. They also kindly provided a free chapter that you can download from the link below. A full review will be posted as soon as I finish the book.

preview chapter – Chapter No.3 “Testing 1, 2, 3: Basic Unit Testing”

Initial impressions about Bioinformatics Programming using Python

Last week I made a 5 book order at Amazon and one of them was Bioinformatics Programming Using Python: Practical Programming for Biological Data (Animal Guide) by Mitchell L Model.

I started reading the book late Friday night, and I’m on the third chapter, where there’s an introduction to sequences. So far, I found the book very confusing, especially as it claims to be a book for people with no programming background. The examples are OK, but there’s a very messy mixture of Python interpreter and standalone script usage, as the author jumps back and forth them. Another thing is that some examples are explained in detail including the line number, while others you depend on the code’s docstring to understand it.

So far, I’m not impressed. The initial Python sequence example is a set and in this chapter there already are some functional programming concepts, what can be quite challenging to someone that has never programmed in their life. And in the second chapter the reader sees a ternary operator. Another criticism, is that in the preface the author suggests using Python 3, instead of 2, what might add to the frustration of the beginner when a module cannot be installed.

I will continue reading it and post whenever I have a more complete overview of the book.

Python for Bioinformatics by Sebastian Bassi: a (short) review

I promised some time ago to post a complete review of Python for Bioinformatics (Chapman & Hall/CRC Mathematical & Computational Biology) by Sebastian Bassi. It’s long overdue, but the delay allowed me to get more acquainted to the book and its contents.

I can only say that I highly recommend this book, especially for the biologist that is beginning in bioinformatics or python (or both). I cannot compare it to any other Python and Bioinformatics books (I’m planning to buy the another one), but I can say that I could learn a thing or two from Sebastian’s book. Evidently is not a perfect book, as some of the explanations are a little bit rushed and might be difficult for a beginner. At the same time this is a very carefully thought and planned book and has more than enough for one to learn Python and apply the language to solve biological problems. I really liked the BioPython section, and this section made me use BioPython for the first time. Some of BioPython’s examples in the book are light years ahead of the examples in the tool’s website.

Lastly, I would like to congratulate Sebastian for his work and effort in putting together a nice tome for Python and Bioinformatics. It’s a valuable resource for everyone in the field and certainly will help spread Python in the community.

Biostar: bioinformatics community

Biostar is a bioinformatics community on the StackExchange network. It’s still small and not a lot of questions are asked and answered every day, so we need more people participating. If you are new to bioinformatics, or are just curious about the newest trends in the field, help us grow.

The real value of blogging

A couple of days ago I posted on here an entry called ‘The “sickest” Python code I’ve ever created‘. It’s a code that does some file management for proteomics data, with a different set of inputs each time you run it.

The “sickest” part of the title is that it was a small challenge to me. I’ve been away of actual hard-core coding for quite sometime, and you lose some of the gist of the thing with time. Mostly, nowadays, I make simple scripts that don’t require any kind of advanced skills (in any language) and I don’t worry that much of releasing code or about ultra fast performance. I knew from the time I posted that a lot of people would jump and help and teach me, as I was aware it wasn’t the most elegant code out there, not even the most Pythonic one too. What also helped me is that my Python/Bioinformatics blog is indexed on Planet Python, and the audience is far more hard-core Python that I ever dreamed of getting by myself alone.

But the real deal is that I believe it would be much more difficult for me to get some positive feedback or even an answer if I had posted bits of my code on a online forum or community or list. Every time I used one of these methods, I either got no answer, or got schooled for not posting in the right format or somebody replied that no one knew how to do it. There’s the real deal of blogging, and the value is even higher if your audience knows more than you do. I appreciate every comment I got on that post and on others too, I learned things that I wasn’t able to learn from computer books and online tutorials (yes, I searched sometimes before reading the comments, and sometimes after).

The “sickest” Python code I’ve ever created 1

But, I guess, it can be easily refactored/enhanced/despised by the audience that read or have access to this blog via Planet Python. Anyway, for someone like me, whose main task now is not to generate tons of code and lines, I think the code (or part of it) that I will present below is quite good. Feel free to comment, criticize and say bad and good things about it.

We needed a script that would take files coming out from protein search engines that would be able to compare the peptides and protein sequences, their abundance and some other characteristics. We had a combination of protein and peptide files, with a list of proteins (one protein per line in a tab delimited file) that was related to a list of peptides in another file (one peptide per line, with multiple peptides/lines related to one protein in the original list). Also each line in both files had more than 50 columns, and 8 or 16 of them were the values we wanted to extract. I say 8 or 16 of them because we didn’t know how many will output each time, as it would depend on the number of samples per run (4 to 8 samples) .

So, we had a couple of issues: we didn’t know how many proteins would be output (actually found) in each file, we wouldn’t know how many peptides for each protein would be found and we didn’t know before hand how many samples would be run at once. One good thing is that the 8-16 columns of values were fixed, always in the same position and with empty cells if no value was registered there. And we had a fourth problem: usually the samples attributions would be random, meaning a control could come in the first value column or could come in the last. And a fifth as we didn’t know before hand (the tech knew) how many treatments would be run each time. A treatment could be a different experimental condition, a sample grouping or some other extraneous factor. An extra issue is that we would need to compare multiple files, and get protein and peptide abundances in all of them at the same time and finally compare each treatment.

Basically, in order to create an universal script we needed something flexible enough that whatever the experiments threw at use we would be able to handle. First step we decided to use a YAML file that could be filled by the experimental researchers with sample assignments, treatments, etc. The YAML would look like this

B0:
– 114: A
– 115: D
– 116: B
– 117: C

B1:
– 114: C
– 115: A
– 116: D
– 117: B

In this file B0 and B1 would be the result file names, 114 is the column/channel where the sample was run and and A, B, C and D the treatment. With this set, out objective was to get all proteins and their peptides for treatment A in files B0 and B1, do some calculation and them compare to all proteins and peptides from treatment B, C and D extracted also from files B0 and B1.

First step was to get the names of the treatments from the YAML file

]def get_treatments(mapping):
    treats = set([])
    for entry in mapping:
        [treats.add(list(t.values())[0]) for t in mapping[entry]]

    return treats

where

mapping

is the YAML file name. We used a set to store and sets have unique items, and treatment names can vary from file to file. In the code above we basically read the YAML and the value for each entry.

We then needed a class to store protein information, and there was the story got hairy. With all my (lack of ) experience, I decided to use

exec

statements to fix all the uncertainty of the experimental data details. I didn’t have the treatment names before hand (or in a fixed immutable list), and didn’t have the columns (channels) that were being used at the time and I have to correctly assign each protein abundance (area) to its place. In the end our class look like this

class Protein():
    """Class Protein, stores all the information about channels and areas, name and accession"""
    def __init__(self, accession, name, treatments):
        self.accession = accession
        self.name = name

        #ratio channels are called based on their name
        for i in treatments:
            exec('self.%s = []' %i)
            exec('self.area%s = []' %i)

    def add_to_channel(self, channel, peptide):
        exec('self.%s.append(peptide)' % channel)

    def add_to_area(self, channel, area):
        exec('self.area%s.append(area)' % channel)

In order to be faithful to this blog’s name, I will explain how the code above is supposed to work. First,

exec

is a Python statement that support dynamic execution of code. In our case above it was used to name the objects, so we would be able to access them by name in subsequent functions. Let’s take this for example

for i in treatments:
    exec('self.%s = []' %i)
    exec('self.area%s = []' %i)

In this snippet we were trying to create lists called (for the YAML file above) A, B, C and D, and another set of lists called areaA, areaB, areaC and areaD. Let’s say for another experiment we would have treatments “Control”, “Low” and “High” and so on.

The next two functions use the same approach, with exec, this time appending to the freshly created lists. This way it’s easy to control what the user is throwing at us.

I don’f know if this the best approach possible, or if it is or not harmful. Maybe experts reading this might have better ideas, and I appreciate them. We check the rest of the script next time.

Python Testing Beginner’s Guide, review

I posted about a week ago that Packt Publishing had invited me to review Python Testing Beginner’s Guide by Daniel Arbuckle. Having finished reading the book (I must admit that I haven’t tried all the code in it), I can say that I have an excellent initial impression of the book.

PTBG is not a long book and the topic is divided in 10 chapters and one appendix. One of the first things that I liked about the book is that there’s no introduction (or something similar) to Python. It just goes straight to the point assuming that you have some good understanding of the language and everything that surrounds it. In the past I was frustrated with some “Introduction to X with Python” that wasted precious space talking over and over about a topic, learning Python, better covered in many other books. PTBG does not waste time and space introducing its main topic which is testing, and in my opinion that’s the best approach, even though it might look a little bit abrupt by some.

The language and text in the book is clear and very pleasant. PTBG is a very well written book and I really enjoyed its style. The first chapters of the book cover Python testing using doctests. For someone like me that didn’t write so many tests in the normal software development workflow (I know I should write more tests), this section seems like a really nice introduction to the topic, with well thought real-life like examples and a good flow on the explanation of the different features. One small complain that I have is that for a beginner sometimes the code listed in the examples might seem a little bit confusing, and maybe the addition of line numbers might have helped a bit here. But at the same I understand that this is normal style of some Packt books.

After the doctests section, PTBG gets into more advanced techniques, covering a little bit mock objects with Mocker, then moving into unittest and nose. The latter is a Python tool that allows for managing, running and automating tests. Also covered is Twill, another third-party library that allows for testing of web applications.

One full chapter is devoted to test-driven development, with a complete walkthrough of this approach. This gives a wrap-up of most of the techniques and modules covered in the book, but there’s still space for another chapter that shows how beautifully doctests, unittest and nose can be fully integrated and help the development of applications using the test-driven approach.

Overall, I really enjoyed PTBG. As I mentioned, test driven development was never a high priority in the application I usually developed with Python. But certainly this book can be a good starting point for some Python test beginners to incorporate these techniques in their usual development workflow. Scientific software is also a perfect niche for this type of approach and we should do what is possible in order to avoid the nightmares of the past.