Alternative methods to split a FASTA file - updated (again)
Phase 2 October 10th, 2007As Daniel didn’t enlightened us on how to use csplit, I am posting several ways on how to split a multiple sequence FASTA file. This post gets out of our focus (if you haven’t noticed, our focus here is Python, and maybe suffers from the NIH effect. Not invented here. We will be back with our normal programming after this.
We start with our “widely” available option csplit:
csplit -kf seq file '%^>%' '/^>/' '{*}'
where k tells the program to keep files in case of an error, f sets a prefix for the output files (otherwise xx00 is used) and two regex patterns are used: between %s is the one to skip and between /s is the one to copy up to but not including the line. In {} is the number of times we want to have the previous patterns repeated, a * meaning as many time as possible.
Easy, eh? Now, we can check how perl fares. I am not a perl monger so I googled it and I found an one-liner. Quoting the original site:
Split a multi-sequence FastA file into individual files named after their description lines.
perl -ne 'BEGIN{$/=">"}if(/^\s*(\S+)/){open(F,">$1")||warn"$1 write failed:$!\n";chomp;print F ">", $_}'
and I am not daring to explain this one. But cariaso did and also sent a better version of it
# This magic variable makes perl read lines
# that end with >
# instead of a newline n
$/=”>”;
while (<>) { # foreach line in the input files
if (/^>(w+)/) { # grab the first word of text
open(F,”>$1″) || # open a file named that word
warn “$1 write failed:$!n”;
chomp; # strip off the > at the end
print F “>”, $_; # print >text to the file
}
}
You can also do in Python, without any OO programming in 5 lines:
file = open('myfile').readlines()
str = ''.join(file)
temp = str.split('>')
for i in temp:
print '>' + i
that can be run with no difficulties from the interactive console. And which cariaso also corrected and improved
fileobj = open("myfile.fasta")
ignore = fileobj.read(1)
text = fileobj.read()
records = text.split(">")
for i in records:
print '>' + i
along the offer of a nicer approach
def eachfasta(fileobj):
sofar = fileobj.readline()
for line in fileobj:
if '>' == line (bracket zero close bracket):
yield sofar
sofar = line
else:
sofar += line
yield sofar
fileobj = open("myfile.fasta")
for i in eachfasta(fileobj):
print i
Also Joe wanted to dip in and sent his approach:
# f6smod.py 10-26-07 jto
# purpose: modify program by Michael Cariaso, given in Paolo Nuin's blog:
# http://python.genedrift.org
import sys
fileobj = open('fasta.seq')
ignore = fileobj.read(1)
# The above line removes the first char from the file object.
# It can also be used to check that the first char is a '>'
if ignore != '>':
print "The first character in the supposed FASTA file is: " + ignore
print "not '>', so sys.exit() is being invoked."
sys.exit()
text = fileobj.read()
records = text.split('>')
# Here, rather than use the for loop to just print out the sequences, it
# is used to store them in a list. After that they can be printed out, or
# stored in separate files, or or further split into header line and sequence
# (using the carriage return at the end of the header file).
seqlist = []
listcount = 0
# store each header/sequence in a list
for i in records:
i = '>' + i
seqlist.append(i)
listcount += 1
# print the list
for seq in seqlist:
print seq
# Split into header line and sequence, and make the sequence a single string.
for seq in seqlist:
splitCR = seq.split('\n')
print "header: " + splitCR[0]
sequence = ''.join(splitCR[1:])
print "sequence: "
print sequence
OK, so three five six more options available. I am pretty sure sed and awk would be able to do it glued by a bash script, but I am far of being an expert in sed or awk. If you know how to do it, leave a comment.
October 10th, 2007 at 7:58 pm
[...] Alternative methods to split a FASTA file As Daniel didn’t enlightened us on how to use csplit, I am posting several ways on how to split a multiple sequence FASTA file. [...]
October 11th, 2007 at 11:41 pm
I think this is a marvellous resource but (and perhaps I missed this), why are you using python instead of biopython. Is that to come later?
I would like to see a tutorial that takes biopython through in very simple steps. I find the biopython ‘Tutorial and Cookbook’ assumes a strong knowledge of python. So you spend then spend an inordinate amount of time learning how to do things the hard way, only to find later that biopython modules have been built specifically to try and make it easy to handle DNA etc.
Cheers, Mike
October 12th, 2007 at 7:11 am
Hi Mike
I have the intention of include some BioPython basic usage in the future. I am not an experienced user of BioPython, but I am currently playing with it and hope to but something soon here.
The idea of the site was to create a basic guide of Python and Bioinformatics, especially for people that don’t know that much Python and/or a lot of biology. Most of the scripts presented here are very simple and short; the scripts can be used as a primer for more advanced programming. And I hope to do that with BioPython too.
I would like to include some basic tutorials of other Python packages too (bioinfo or not) and even some basic steps to build interfaces with wxPython.
If there is an specific topic you would like to see it covered, please let me know. And thanks for your support.
Cheers
Paulo
October 13th, 2007 at 3:48 pm
I think your process is excellent, please continue. I hope to see the same core python theme continue. Others seems to want a bioinfomatics course with python as the language.
the old perl motto was
Tim Toady. In that spirit, I was unaware of cplit and am still unlikely to use it. I probably would have kicked this one off in perl, and assuming someone will find it useful, I will try to explain the perl above. I expect there is also a sed/awk solution, and I’d be curious to hear those if anyone can enlighten me.
Incidentally I use biopython daily, but mainly for the ncbi eutils interfaces. I certainly don’t know the biopythonic way to split fasta. I expect Andrew Dalke might.
# This magic variable makes perl read ‘lines’
# that end with ‘>’
# instead of a newline ‘\n’
$/=”>”
while () { # foreach ‘line’ in the input files
if (/^\s*(\S+)/) { # grab the first ‘word’ of text
open(F,”>$1″) || # open a file named that ‘word’
warn “$1 write failed:$!\n”;
chomp; # strip off the “>” at the end
print F “>”, $_; # print ‘>text’ to the file
}
}
October 16th, 2007 at 3:50 pm
Hi Cariaso
Thanks for the Perl explanation. I am researching a way to do it with sed and awk and I post it if no one posts before.
I also expect that Andrew would know the Pythonic way! I will probably write him and ask.
Cheers
Paulo
October 17th, 2007 at 8:08 pm
[...] have posted in my other site different ways to split a FASTA file. Take a look. The Python method is [...]
October 19th, 2007 at 5:03 pm
The perl above works better as
# This magic variable makes perl read lines
# that end with >
# instead of a newline \n
$/=”>”;
while () { # foreach line in the input files
if (/^>(\w+)/) { # grab the first word of text
open(F,”>$1″) || # open a file named that word
warn “$1 write failed:$!\n”;
chomp; # strip off the > at the end
print F “>”, $_; # print >text to the file
}
}
It as well as the python scripts above suffer from an extra ‘>’ at the beginning of the output.
An easy fix is
fileobj = open(”myfile.fasta”)
ignore = fileobj.read(1)
text = fileobj.read()
records = text.split(”>”)
for i in records:
print ‘>’ + i
But I consider this a better approach
def eachfasta(fileobj):
sofar = fileobj.readline()
for line in fileobj:
if ‘>’ == line[0]:
yield sofar
sofar = line
else:
sofar += line
yield sofar
fileobj = open(”myfile.fasta”)
for i in eachfasta(fileobj):
print ‘**[%s]**’ % i
October 19th, 2007 at 5:13 pm
for a legible copy of the above visit
http://mike.pbwiki.com/pythonfasta
October 19th, 2007 at 6:38 pm
Hi Cariaso
I will add your code to the post (referencing it of course). I will work on a sed/awk solution this weekend.
Thanks a lot for your input.
October 26th, 2007 at 11:32 am
Dear Pablo,
It’s been fascinating reading your blog. And very instructive. I’m a beginner in Python and bioinformatics, but have been able to follow it (really, that’s not a boast but a compliment).
The way to split a FASTA file with Python that Michael Cariaso sent in is short, efficient and understandable (maybe that’s what Pythonic means). I’m sending a little modification (actually two small additions) to it. I’m pretty sure that they’d be obvious additions to you and to him, but, if correct, they might be useful and instructive to others who are “beginning Python for bioinformatics” like me. It seems to work with the fasta.seq file you supplied in you blog, and some permutations and multiplications of it that I invented.
At any rate, if you have the time to examine and correct it, I’d be grateful.
Joe
# f6smod.py 10-26-07 jto
# purpose: modify program by Michael Cariaso, given in Paolo Nuin’s blog:
# http://python.genedrift.org
# archive: kept by joeoettinger@netscape.net
import sys
fileobj = open(’fasta.seq’)
ignore = fileobj.read(1)
# The above line removes the first char from the file object.
# It could also be used to check that the first char is a ‘>’
if ignore != ‘>’:
print “The first character in the supposed FASTA file is: ” + ignore
print “not ‘>’, so sys.exit() is being invoked.”
sys.exit()
text = fileobj.read()
records = text.split(’>’)
# Here, rather than use the for loop to just print out the sequences, it
# is used to store them in a list. After that they can be printed out, or
# stored in separate files, or be further split into header line and sequence
# (using the carriage return at the end of the header file).
seqlist = []
listcount = 0
# store each header-sequence in a list
for i in records:
i = ‘>’ + i
seqlist.append(i)
listcount += 1
# Just to show it’s working right, print the list
for seq in seqlist:
print seq
# Split into header line and sequence, and make the sequence a single string.
for seq in seqlist:
splitCR = seq.split(’\n’)
print “header: ” + splitCR[0]
sequence = ”.join(splitCR[1:])
print “sequence: ”
print sequence
October 26th, 2007 at 12:20 pm
Hi Joe
Would you mind sending me source in an email? The comments are not formatted for Python.
Thanks
Paulo
October 26th, 2007 at 3:20 pm
Dear Paolo,
I’m embarrassed to say that I can’t find the email address on your blog to send the source to. If you don’t mind sending same to me or telling me where to find it, I’ll be glad to send the source. Thanks for your patience.
Joe
October 26th, 2007 at 3:30 pm
Hi Joe
I sent you an email to your netscape address (the one I have registered). Check your spam box otherwise my email is
nuin at genedrift dot org
Cheers
Paulo
October 26th, 2007 at 5:33 pm
Dr. Oettinger MD
With the whitespace issues its hard for me to be sure, but it appears that you solution does add some useful checks. However I’d like to caution you on one aspect. It appears you are trying to keep all of the sequences in memory, before you finally print them. That is fine for small files, but fasta files are often quite huge (especially the ones you’d want to split). If so, your code will hit a pretty hard wall once the input file is nearly as big as ram.
October 26th, 2007 at 6:33 pm
Dear Paolo,
Now I see what you mean! Somehow my indents don’t show on the comments posts! They should show on the email I sent, but I’m using a lot of your time, and this must be rather boring to any readers of the comments. So I’m going to post less and read more for a while.
Thanks for the advice on not keeping all the data in memory.
Joe
October 27th, 2007 at 9:54 am
Hi Joe
Don’t worry about commenting and please continue doing so. And don’t worry about using my time or annoying the readers. It is very important to receive comments, suggestions, ideas and criticism.
Please continue posting. Or send me an email when needed, it is a pleasure to interact.
I am posting your code to the topic this afternoon.
Paulo
October 29th, 2007 at 11:37 am
[...] saw that “top notch bioinformaticians” use csplit to split FASTA files, so I decided to post as many as possible alternatives to split these files. As csplit, awk is something found with more frequency in Linux machines than [...]
February 4th, 2008 at 10:29 am
I have one question, how can I process the following format in perl:
>fasta1
AAGCCCCCCCCCCCCCCCCCCCCCCCCCCCCGAGAT
1 80
16.30 (((((()))))>>>…….())))))))) 0.999
GGGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCGG
2 81
16.30 (((((()))))>>>…….())))))))) 0.999
.
.
>fasta2
1 80
AACCCGGGGGGGGGGGGCCCCCCCCCCCCCCCCCCCC
14.40 (((((()))))>>>…….())))))))) 0.89
AACCCGGGGGGGGGGGGCCCCCCCCCCCCCCCCCCCC
140 (((((()))))>>>…….())))))))) 0.89
.
.
I want to read the fasta header and perse the value from each line 14.40 (((((()))))>>>…….())))))))) 0.89″
using perl REGEX. I know I need two while loop(one to read a line and one to check fasta header) but unable to get the result I need. Each fasta file has multiple line under them and the number varies but the fomat is the one described above.
Here is the result I need:
Fasta1
$value1= AAGCCCCCCCCCCCCCCCCCCCCCCCCCCCCGAGAT
$value2 = 1
$value3 = 80
$value4 = 16.30
$value 5 =(((((()))))>>>…….()))))))))
$value6 = 0.999
Fasta1
…..
fast1
—-
I know how to handle regular expression and passing the above value to database. I only need the loop to itrate
Thank you,
Dereje
February 4th, 2008 at 1:40 pm
My perl is a little rusty (the site is about Python!!) but my best bet would be to read line by line checking for the first character. If it is a ‘>’ you process the lines until you find a new ‘>’.
A ‘while’ loop inside a if clause with a couple of flags should to the trick.
HTH
Paulo