A script to simulate DNA sequence sets
Section 4 April 4th, 2007In Beginning Perl for Bioinformatics the chapter that covers simulated mutations on a DNA sequence is quite verbose and the code examples employ some subroutines to do what we have done on the last post. Basically the code example that generates a random DNA sequence is the last one on the chapter, but it was the first one we covered. Also this code example has a twist that our code from the last post does not have, which is it allows you to generate a set of sequences with different length instead of one sequence with fixed length that our script does.
So, the script that generates a user defined simulated DNA sequence set is
#!/usr/bin/env python
import random
import sys
def simulate_sequence(length):
dna = ['A', 'C', 'G', 'T']
sequence = ''
for i in range(length):
sequence += random.choice(dna)
return sequence
setsize = int(sys.argv[1])
minlength = int(sys.argv[2])
maxlength = int(sys.argv[3])
sequenceset = []
for i in range(setsize):
rlength = random.randint(minlength, maxlength)
sequenceset.append(simulate_sequence(rlength))
for sequence in sequenceset:
print sequence
Simple and efficient. First we define a function that generates a simulated DNA sequence from the four nucleotides, again using the random.choice. The sequence length is based on the parameter received by the function. The sequence length is another random number defined in the main body of the script. This random number is generated by random.randint with a range based on the arguments given by the user when running the script. And finally, the number of sequences to be simulated is define by the first parameter. Seventeen lines.
Next we will see how to draw some scientific information about the sequences, such as sequence identity and nucleotide frequency.
October 25th, 2007 at 11:15 am
Hi Paulo:
No matter how I change the value of ’setsize’, the above script always produces 1 set of sequence, that is supposed to produce the different sets of random sequences according to the changes of ’setsize’.
I have done a bit modification with above script after line 17, that works fine now. Please have a look and jive some suggestions.
————
rlength = []
for i in range(setsize):
rlength.append(random.randint(minlength, maxlength))
for length in rlength:
sequenceset.append(simulate_sequence(length))
for sequence in sequenceset:
print sequence
——————-
regards!
Dilmurat
October 25th, 2007 at 3:55 pm
Hi Dilmurat
In fact this script generates only one set of sequences. It was never intended to generate more than one set. The difference to the previous entry is that it can generate sequences with different lengths.
Sorry if the explanation wasn’t clearer. I will write a post and incorporate your ideas.
Cheers
Paulo
October 26th, 2007 at 11:43 am
Hi Paulo:
I think there is a slight problem to view above code with mozilla, that the reason confused me. Now I checked the code with Internet Explorer, that works fine.
The viewing problem happens to the following code:
—-
for i in range(setsize):
rlength = random.randint(minlength, maxlength)
sequenceset.append(simulate_sequence(rlength))
—-
With mozilla, it shows:
—
for i in range(setsize):
rlength = random.randint(minlength, maxlength)
sequenceset.append(simulate_sequence(rlength))
—
Thanks!
Dilmurat
October 26th, 2007 at 12:15 pm
[...] commenter, Dilmurat, gave me an idea about the script that generates random DNA sequence sets. Apparently it wasn’t clear that the script was [...]