99 problems
off topic Comments OffAndré Roberge, of Crunchy fame, is converting a list of 99 Prolog problems to Python. Check it out.
Contact form
off topic Comments OffI added a contact form if anyone is interested in dropping me a note. It can be accessed from the menu above.
Splitting a FASTA file using awk (no sed required), or do we care about csplit?
Phase 2, off topic Comments OffWe saw that “top notch bioinformaticians” use csplit to split FASTA files, so I decided to post as many as possible alternatives to split these files. As csplit, awk is something found with more frequency in Linux machines than Windows, but it can be installed on Windows (even Vista) and it runs fine. Awk is usually shown in parallel with sed, as both utilities have the main objective of text processing.
I browsed the web searching for a sed/awk solution to split FASTA files something that I thought could be done in one line. And I found it. With a little tweak I ended up with a one-liner using awk that split the sequences in multiple files:
awk '/^>/{f="fasta."++d} {print > f}'
Awk breaks each line of its input into fields that can be checked, printed or ignored. In the command line above. awk’s commands are in between the single quotes. The first command is a regex that matches initial >. Inside the curly braces we assign a variable f to store the output filename we want and also have an integer that is incremented every time we pass through it. The second curly braces are for the actual output, where we use awk’s print command and redirect output to a file named after the variable we created in the first commands in between curly braces. Finally, we assign an input file, outside the region delimited by single quotes. And that’s it. As output we will have a series of files called fasta.# each one containing one sequence.
Generating multiple sets of random DNA sequences with one script (and a bash example) [updated]
Section 4, off topic 3 Comments »A commenter, Dilmurat, gave me an idea about the script that generates random DNA sequence sets. Apparently it wasn’t clear that the script was intended to generate only one sequence set, and not multiple sets. Dilmurat also offered his solution:
#!/usr/bin/env python
import random
import sys
def simulate_sequence(length):
dna = ['A', 'C', 'G', 'T']
sequence = ''
for i in range(length):
sequence += random.choice(dna)
return sequence
setsize = int(sys.argv[1])
minlength = int(sys.argv[2])
maxlength = int(sys.argv[3])
rlength = []
sequenceset = []
for i in range(setsize):
rlength.append(random.randint(minlength, maxlength))
for length in rlength:
sequenceset.append(simulate_sequence(length))
for sequence in sequenceset:
print sequence
I don’t know if the lack of formatting in the comments might have generated a different result to me, but let’s see how we can generate multiple sets of random DNA sequence. Getting back to section 4, we can start with the original script that creates only one set
#!/usr/bin/env python
import random
import sys
def simulate_sequence(length):
dna = ['A', 'C', 'G', 'T']
sequence = ''
for i in range(length):
sequence += random.choice(dna)
return sequence
setsize = int(sys.argv[1])
minlength = int(sys.argv[2])
maxlength = int(sys.argv[3])
sequenceset = []
for i in range(setsize):
rlength = random.randint(minlength, maxlength)
sequenceset.append(simulate_sequence(rlength))
for sequence in sequenceset:
print sequence
There is a very easy way of changing this code to make it output a definite number of sets. WE add a new input parameter and also a loop to it, and that should take care of everything.
#!/usr/bin/env python
import random
import sys
def simulate_sequence(length):
dna = ['A', 'C', 'G', 'T']
sequence = ''
for i in range(length):
sequence += random.choice(dna)
return sequence
setsize = int(sys.argv[1])
minlength = int(sys.argv[2])
maxlength = int(sys.argv[3])
nsets = int(sys.argv[4])
for i in range(nsets):
sequenceset = []
for i in range(setsize):
rlength = random.randint(minlength, maxlength)
sequenceset.append(simulate_sequence(rlength))
for sequence in sequenceset:
print sequence
print
A test with 10 as the number of sequences, 20 as minimum and 30 as maximum length and 2 sets output somthing like this
GCAGATAAACGGGTAAGATGA
ATTAGGAAGGGTCTAAGAGATCCCGGTAA
ATCAGCCTCCGCACGCCTCGTGTA
AGTTTAGCTAGCCGTAGTACTTTC
TTTGAAGCTCGGGGCTAAACTCCGACCAC
GACAGGGTAAGGTGACCTTCTCAG
TGGTTGACTCTGTTCTCTGGT
CGCAATACGTCGTGGTGCCGCTCCACT
TCCTGCAGGGAGGCAGCAGC
TTCGGAGGTATTGAGTACTGTTGCATCG
CCCTTATAATACCTAGTGTAACATTC
TAGTCATGTCATATAACCTGTGTCG
AGCGCGAACCCTCCAGTTTTTTT
AGGGCGTGTAAGCGACGTTGAACAAG
CCATGCGTCTTTTTGGCGATCCGT
ACTCAGATCGGCTAGTATCCTTCGCACG
CTATCGTCGTTCCCAGCAACGGCAGCAGA
CGTGCGATGGGGCGAGATTTTT
ACGCTTTGGCTAGTGGGGAGTTTCGCCACT
what is basically what we need. Quick ‘n’ easy.
Of course a top notch bioinformatician that uses Linux 100% of the time would be able to create a simple bash script and use the initial script to generate multiple random sets. bash is not a primary scope of the blog, but let’s see how to do that. Saying the our initial script was saved as randomset.py we would have a very short script in bash to generate multiple sets. We want to generate 10 sets of sequences of length between 20 and 30 nucleotides, and each set would need to have 10 different sequences (remeber that our initial script only accepts 3 parameters).
Off topic
for ((i=0;i<10;i++))
do
min=20
max=30
nseqs=10
python randomset.py $nseqs $min $max
echo
done
That should do the trick. You can even type it from the command line. The for syntax in bash is very similar to C/C++ but it does not accept spaces, as all the variable assignments. When you assigning a variable value in bash you omit the dollar sign and when you are reading the variable value you access it with the dollar sign, like in the line that runs the script. A for loop in bash starts with a do command and ends in a done command, not brackets or parenthesis involved. The echo at the end helps adding a separating blank line after each random set.
off topic
Of course we could use "makenucseq from the EMBOSS package", we can even use Dawg from my friend Reed Cartwright, which I believe has a shorter learning curve (and faster compilation) than EMBOSS, but that will be away from the scope of this blog, which is to show how to use Python in simple bioinformatics scripts. I allowed myself to post off topic subjects (including this last paragraph) because a (very) small percentage of the commenter(s) seems to misunderstand the real focus of this tutorial. We don't need fancy words or complex topics to show how simple things can achieve good results in almost no time. My focus is not give the reader the fish and chips ready to eat. My focus is to show how to fish and peel, cut and fry the potato in order to cook the dish. This way, next time he or she might be cooking better things to eat.
Off topic
off topic 1 Comment »Mr Daniel Swan accused me of not publishing his comment or reply to his comments. I have never seen his reply, either in the comments or as an email to me, so to fulfill his desires it is published here.
I admit that some of the things you have addressed in the past, whilst ‘reinventing’ the wheel, offer important insights into the general problems of programming in general – this particular example was not one of them. From following the course which I assume was outlined by ‘Beginning Perl for Bioinformatics’ (which I do not think is a good reference for beginning Perl at all) a user should already be able to perform the following operations – open a file, iterate over and parse the contents, open a file for writing, and apply some rules for formatting the output. Thus you have already given them the tools to complete the operation should they wish to do so in a Python context.
The problem is that you are trying to educate potentially early career/transitional bioinformaticians here – and showing them how to complete tasks in X lines of Python when there is a single command line application to do the trick I think is disingenuous. We wouldn’t dream, for instance, of educating our MRes students in this way.
I also would take issue that diversity is a key. You can be as diverse as you like but no-one is seriously going to suggest that you write your next script to parse something out of a GenBank file in assembler. One of the hallmarks of any computer professional in any field is the ability to use the right tool for the job. You should be technology agnostic and be prepared to string together applications or scripts from anywhere to reach your goal – take a look at something like Taverna for instance.
I notice in the other post that you mention there is a high proportion of Windows users to this site and therefore command line *nix comments are of no interest. Although I know a number of bioinformaticians who use Windows (generally to integrate with an institutions existing Microsoft heavy infrastructure) they all have logins to dozens of *nix machines, run *nix in a VM or at the very least have a Cygwin installation with which to work with. Basing the availability of technology on the OS of your page imprints is not necessarily indicative.
I hope you don’t mind me addressing these points here rather in my blog, as my blog is personal (hence the lack of bioinformatics content on there). I have actually been following these tutorials for a while (as someone who came into Bioinformatics some time ago – my first forays in programming were in Perl) it is nice for me to see tasks that I find intuitive in one language written in another. Consequently the fact that I felt need to comment, merely suggests that I felt strongly on this issue, rather than having any negative feelings towards what you are trying to achieve.
I am sorry for the usual readers of this tutorial, for posting his comments as an article here. We will be back to our uneducated, simple and basic “tutorial for bioinformatics in X lines for the noobie in Windows” soon.
Thanks Daniel.
Recent Comments