Splitting a multiple sequence FASTA file
Phase 2 October 9th, 2007starting the new phase of the website, we are going to see some random thought in Python and Bioinformatics. Any ideas would be highly appreciated, and I will use some terms used in searches that ended up finding this blog or this or this. We will also see some of the ideas that are being developed in Infasta.
We will see the easiest way to split a FASTA file. There are many ways to name each file after the split, and we will use a simple original filename + number. We also going to use the FASTA class and functions defined previously in fasta.py. To refresh our memories here is the code
class Fasta:
def __init__(self, name, sequence):
self.name = name
self.sequence = sequence
def read_fasta(file):
items = []
index = 0
for line in file:
if line.startswith(">"):
if index >= 1:
items.append(aninstance)
index+=1
name = line[:-1]
seq = ''
aninstance = Fasta(name, seq)
else:
seq += line[:-1]
aninstance = Fasta(name, seq)
items.append(aninstance)
return items
So, basically we need to create a script that reads the file and then in a loop saves every sequence in a separate file. For the sake of simplicity all saved files will be in the same directory. To the script
#!/usr/bin/env python
import sys
import fasta
#define a variable with the input filename
file = sys.argv[1]
#open the file and read the sequences in one line
sequences = fasta.read_fasta(open(file, 'r').readlines())
#counter to keep the number the sequences and to
#be written to the output files
count = 1
for i in sequences:
#open a file to output
output = open(file+str(count), 'w')
#write the name and sequence
output.write(i.name+'\n')
output.write(i.sequence)
#increment the counter
count += 1
Very simple but nice script. In the end we will have a list of files with only one sequence in it, with identical name of the original file but with a number in the end. Next we will see how to modify the filename and even extract some information from the FASTA sequence to use as a filename.
October 10th, 2007 at 3:30 am
This is one of those ‘bioinformatics’ tasks where people are seriously guilty of reinventing the wheel. If for the moment we assume most bioinformaticians are working on a *nix workstation ‘csplit’ should probably be your first port of call for context based splitting of files! And it’s far more widely applicable than for just this example.
Remember – if it seems like a logical operation and you are in the ‘why hasn’t anyone made a utility to do this?’ frame of mind – look around because they probably have!
October 10th, 2007 at 7:49 am
Hi Daniel
Maybe you have misunderstood the real focus of this blog/tutorial. I am not trying to reinvent the wheel, but create real-life examples with Python and bioinformatics. The main objective here is to expose people to Python.
If the intention of this site was to show Unix/Linux commands I would be writing about sed, awk, cat, grep, etc that would in the end replace everything that I posted here and more.
Remember that not everyone is like you with access to a Unix/Linux workstation and knowledge of all Unix/Linux commands. Some people (at least 69%) access this site from a Windows machine and have no access to a bash shell. You could easily mirror all posts here and transform them into a Unix/Linux string of commands, just to show people how to do it from the command line in your blog.
And if reinventing the wheel was a problem, we shouldn’t have so many redundant programming languages, should we?
October 18th, 2007 at 5:23 am
[...] the site has drifted away from the book, and I was compelled to comment on this article here. This caused, I think, some annoyance to the author who felt the need not only to rebut me in the [...]