Mar 26

With the last post we finished Section 3 of the book. As always some subjects were not covered, maybe because Python is different than perl or maybe that is not suitable for an introduction to Python.

From now on, the book presents much longer scripts and I will try to follow closely, and include some modifications and updates that I find opportune.

Regarding comments, I have been receiving a lot of spam in the Blind.Scientist blog so I kept it closed here until I learned a little bit more about WordPress administration. Please fell free to post your comments and critics and point out errors and mistakes.

Thanks.

Mar 26

We have seen, briefly, how to define and use a function in Python. Now we are going to jump forward a bit and create a new function and at the same time take a look on command line parameters that can be passed to the script.

If you have used command line applications before, you might have encountered programs that asks for a file name, a calculation parameter, etc to be passed in the command line. Python scripts are no different, they accept such parameters. For this we have the sys module that has system specific parameters and functions. We have used before the sys.exit, imported as an extra module function. Every operating system (even Windows) has arguments in its command line, and programming languages usually call such arguments argv (in the C/C++ you have argv in the parameters of the main function). Lists in Python start at 0 (zero), and for the argument list the first item is the script/program name. Basically if we have this

$> python myscript.py DNA.txt

myscript.py is the argument 0 in the list and DNA.txt is the argument 1. So whenever we create a script that receives arguments in the command line, we have to start (in most cases, be aware) from 1. In Python using system arguments in the CLI will look like

import sys

filename = sys.argv[1]
valueone = sys.argv[2]
...

We will a variation of our previous script that counts the bases, now with command line arguments and a function (with no “error” checking at first)

#!/usr/bin/env python

import sys

def count_nucleotide_types(seq):
    result = []
    totalA = seq.count('A')
    totalC = seq.count('C')
    totalG = seq.count('G')
    totalT = seq.count('T')

    result.append(totalA)
    result.append(totalC)
    result.append(totalG)
    result.append(totalT)

    return result

sequencefile = open(sys.argv[1], 'r').readlines()
sequence = ''.join(sequencefile)
sequence = sequence.replace('n', '')
values = count_nucleotide_types(sequence)
print "Found " + str(result[0] + "As"
print "Found " + str(result[0] + "Cs"
print "Found " + str(result[0] + "Gs"
print "Found " + str(result[0] + "Ts"

Few new things here. We created a function count_nucleotide_types that should receive a string containing the sequence. The “real” first line of the program flow is the one that gets the name of the file from the command line argument, open and read it. We then convert the list to a string, modify it a but and throw it to the function. Get the result back, and done.

With functions we actually don’t save coding time/length (at least here), we make out code more organized, easier to read and somewhat easier to someone else read and understand it. It is not a good coding practice to have long programs/scripts with no functions, no subdivision, no structure. Functions are sometimes good program nuggets that can be reused in the same application or even ported/copied to other applications and reused indefenetely. Soon we will see a function and class that reads a FASTA file in Python that can be used anywhere in any program that needs such feature. Try the code and come back later for more.

Mar 22

Python subroutines do not exist. Everything is a function, all functions return a value (even if it’s None), and all functions start with def. This statement is from Dive into Python, a book on Python programming available for free. As mentioned Python functions start with the word def, which is followed by the function name that is followed by the arguments the function receives in between parentheses. Something like

def my_first_function(somevalue):

Usually Python coders (sometime called Pythonistas, among others), following the Python coding style (that states: Function names should be lowercase, with words separated by underscores as necessary to improve readability.) name their functions with words spearated by underscores. And we are going to use this style here, whenever a function becomes handy. The parameters passed to the function (above somevalue) do not have a datatype, Python should handle it whatever is being passed. It is also attribute of your code to handle the parameter/value passed inside the function and avoid errors. Functions also follow the same identation of normal programming and the line after the decalaration should be idented with four spaces

def my_first_function(somevalue):
    do_something

So, let’s warmup with functions. The following script is just the start: it adds a poly-T tail to a DNA sequence. We are going to use our old friend AY162388.seq. I will be back after the script

#! /usr/bin/env python

def add_tail(seq):
    result = seq + 'TTTTTTTTTTTTTTTTTTTTT'
    return result

dnafile = 'AY162388.seq'
file = open(dnafile, 'r')

sequence = ''
for line in file:
    sequence += line.strip()

print sequence
sequence = add_tail(sequence)
print sequence

Not very useful, at first sight, but gives us an impression of what a function looks like. Basically we define a function add_tail that receives seq as a parameter. Don’t worry about variable scope now, we will see it later. The rest of the script is just like things we saw before, except for the line sequence = add_tail(sequence). Here we are saving memory (yep, not that much and not even impressive) by assigning the return value of the function to the same string where we have the sequence stored. Run the scritp and get ready for the command line arguments.

Mar 15

Beginning the third section in our tutorial/guide, we are going to see the chapter six of BPB. This chapter discusses the topics of creating subroutines (in Python’s case functions) and debugging the code.

We are going to start by the end. In Python, code debugging can be done as in any other programming language: Perl has pdb, C/C++ has gdb, etc. Python also has a pdb module that can be imported and run to check for errors in your code. Using this command line:

python -m pdb myscript

will start the debug module and this will run your script. If you are an experienced programmer, who is just starting Python, pdb usage might look simple and straightforward. On the other hand, if you don’t have a lot of experience in programming I would suggest a different approach, as you become more confortable with the language. Python has a great advantage over some other interpreted languages, allowing you to interactively code using the interpreter. So if your code is not working properly, maybe a wrong output or a value that is not being correctly calculated you have the options of coding the part of your script that is not working using the interpreter or use the first rule of debugging: include print statements that output the value of variables/objects.

Another option is to use a Python code editor, what will also help you with highlight your code. I have little experience with Python code editors, as I normally code in Linux and use Kate. Lately I have been trying Komodo edit which is a cross-platform freeware from Active State. It looks pretty good but I never tried debugging my code with it.

So, these are my advices if you are just starting to program. Maybe because of the age of Beginning Perl for Bioinformatics (published in 2001), Perl’s pdb was the only option back then. Thanks to major advances on open-source and free software there are many other options nowadays to debug your code.

python -m pdb myscript