Generating multiple sets of random DNA sequences with one script (and a bash example) [updated]

Section 4, off topic 3 Comments »

A commenter, Dilmurat, gave me an idea about the script that generates random DNA sequence sets. Apparently it wasn’t clear that the script was intended to generate only one sequence set, and not multiple sets. Dilmurat also offered his solution:

#!/usr/bin/env python

import random
import sys

def simulate_sequence(length):
    dna = ['A', 'C', 'G', 'T']
    sequence = ''
    for i in range(length):
        sequence += random.choice(dna)
    return sequence

setsize = int(sys.argv[1])
minlength = int(sys.argv[2])
maxlength = int(sys.argv[3])

rlength = []
sequenceset = []
for i in range(setsize):
    rlength.append(random.randint(minlength, maxlength))

for length in rlength:
    sequenceset.append(simulate_sequence(length))

for sequence in sequenceset:
    print sequence

I don’t know if the lack of formatting in the comments might have generated a different result to me, but let’s see how we can generate multiple sets of random DNA sequence. Getting back to section 4, we can start with the original script that creates only one set

#!/usr/bin/env python

import random
import sys

def simulate_sequence(length):
    dna = ['A', 'C', 'G', 'T']
    sequence = ''
    for i in range(length):
        sequence += random.choice(dna)
    return sequence

setsize = int(sys.argv[1])
minlength = int(sys.argv[2])
maxlength = int(sys.argv[3])

sequenceset = []
for i in range(setsize):
    rlength = random.randint(minlength, maxlength)
    sequenceset.append(simulate_sequence(rlength))

for sequence in sequenceset:
    print sequence

There is a very easy way of changing this code to make it output a definite number of sets. WE add a new input parameter and also a loop to it, and that should take care of everything.

#!/usr/bin/env python

import random
import sys

def simulate_sequence(length):
    dna = ['A', 'C', 'G', 'T']
    sequence = ''
    for i in range(length):
        sequence += random.choice(dna)
    return sequence

setsize = int(sys.argv[1])
minlength = int(sys.argv[2])
maxlength = int(sys.argv[3])
nsets = int(sys.argv[4])

for i in range(nsets):
    sequenceset = []
    for i in range(setsize):
        rlength = random.randint(minlength, maxlength)
        sequenceset.append(simulate_sequence(rlength))

    for sequence in sequenceset:
        print sequence

    print

A test with 10 as the number of sequences, 20 as minimum and 30 as maximum length and 2 sets output somthing like this

GAGGACGACGTGTCGGATTAGAT
GCAGATAAACGGGTAAGATGA
ATTAGGAAGGGTCTAAGAGATCCCGGTAA
ATCAGCCTCCGCACGCCTCGTGTA
AGTTTAGCTAGCCGTAGTACTTTC
TTTGAAGCTCGGGGCTAAACTCCGACCAC
GACAGGGTAAGGTGACCTTCTCAG
TGGTTGACTCTGTTCTCTGGT
CGCAATACGTCGTGGTGCCGCTCCACT
TCCTGCAGGGAGGCAGCAGC

TTCGGAGGTATTGAGTACTGTTGCATCG
CCCTTATAATACCTAGTGTAACATTC
TAGTCATGTCATATAACCTGTGTCG
AGCGCGAACCCTCCAGTTTTTTT
AGGGCGTGTAAGCGACGTTGAACAAG
CCATGCGTCTTTTTGGCGATCCGT
ACTCAGATCGGCTAGTATCCTTCGCACG
CTATCGTCGTTCCCAGCAACGGCAGCAGA
CGTGCGATGGGGCGAGATTTTT
ACGCTTTGGCTAGTGGGGAGTTTCGCCACT

what is basically what we need. Quick ‘n’ easy.

Of course a top notch bioinformatician that uses Linux 100% of the time would be able to create a simple bash script and use the initial script to generate multiple random sets. bash is not a primary scope of the blog, but let’s see how to do that. Saying the our initial script was saved as randomset.py we would have a very short script in bash to generate multiple sets. We want to generate 10 sets of sequences of length between 20 and 30 nucleotides, and each set would need to have 10 different sequences (remeber that our initial script only accepts 3 parameters).

Off topic

for ((i=0;i<10;i++))
do
min=20
max=30
nseqs=10
python randomset.py $nseqs $min $max
echo
done

That should do the trick. You can even type it from the command line. The for syntax in bash is very similar to C/C++ but it does not accept spaces, as all the variable assignments. When you assigning a variable value in bash you omit the dollar sign and when you are reading the variable value you access it with the dollar sign, like in the line that runs the script. A for loop in bash starts with a do command and ends in a done command, not brackets or parenthesis involved. The echo at the end helps adding a separating blank line after each random set.

off topic

Of course we could use “makenucseq from the EMBOSS package”, we can even use Dawg from my friend Reed Cartwright, which I believe has a shorter learning curve (and faster compilation) than EMBOSS, but that will be away from the scope of this blog, which is to show how to use Python in simple bioinformatics scripts. I allowed myself to post off topic subjects (including this last paragraph) because a (very) small percentage of the commenter(s) seems to misunderstand the real focus of this tutorial. We don’t need fancy words or complex topics to show how simple things can achieve good results in almost no time. My focus is not give the reader the fish and chips ready to eat. My focus is to show how to fish and peel, cut and fry the potato in order to cook the dish. This way, next time he or she might be cooking better things to eat.

Still simulation

Section 4 Comments Off

We use our last code as a starting point in order to generate some real information from our simulated sequence sets. Our approach here will be the same: functions to do all the work for us and a very simple main code. We also reuse some code with applied before to count the nucleotides.

Let’s see the code, discussion just after it.

#!/usr/bin/env python

import random
import sys

def simulate_sequence(length):
    dna = ['A', 'C', 'G', 'T']
    sequence = ''
    for i in range(length):
        sequence += random.choice(dna)
    return sequence

def nucleotide_percentage(sequence):
    print str(sequence.count('A')) + ' As ',
    print str(sequence.count('C')) + ' Cs ',
    print str(sequence.count('G')) + ' Gs ',
    print str(sequence.count('T')) + ' Ts'

def sequence_identity(set):
    iden = []
    count = 0.0
    for x in range(len(set)-1):
	print str(x), str(x+1)
        for n in range(len(set[x])):
            if set[x][n] == set[x+1][n]:
                count += 1
        iden.append(count/len(set[x]))
        count = 0.0
    return iden

setsize = int(sys.argv[1])
minlength = int(sys.argv[2])
maxlength = int(sys.argv[3])

sequenceset = []
for i in range(setsize):
    rlength = random.randint(minlength, maxlength)
    sequenceset.append(simulate_sequence(rlength))

identity = sequence_identity(sequenceset)

for i in range(len(sequenceset)):
    print sequenceset[i]
    if i < len(sequenceset)-1:
        print 'sequence identity to next sequence : ' + str(identity[i])
    nucleotide_percentage(sequenceset[i])
    print

Well, not many new things here. The main one might the be the assignment of the variable count, which receives initially a value of 0.0, which in this case is a float. This variable is used the calculate the sequence identity which is a percentage, or a relative value. If we had initialized count with zero, our division count/len(set(x) would always be zero, due to the fact that count would have been an integer. The ideal input for this script would have equal minimum and maximum sequence lengths, as it the simulated set would look more like an alignment with no indels. Something like:

python simulation.py 10 30 30

As is pointed out in BPB, this example is more an indication that we are able to use our Python skills to actually make some real code, with some real output. One issue with this example is the fact that we only calculate sequence identity of two sequences at a time. It would be ideal to have sequence identity between all simulated sequences. This would require much more code, of course a good educational step, but it is something that can be easily obtained with classes and we will see this later on.

With this entry, we finished our Section 4 and we will start Section 5 with Python’s dictionaries, moving to FASTA files and classes next.

A script to simulate DNA sequence sets

Section 4 4 Comments »

In Beginning Perl for Bioinformatics the chapter that covers simulated mutations on a DNA sequence is quite verbose and the code examples employ some subroutines to do what we have done on the last post. Basically the code example that generates a random DNA sequence is the last one on the chapter, but it was the first one we covered. Also this code example has a twist that our code from the last post does not have, which is it allows you to generate a set of sequences with different length instead of one sequence with fixed length that our script does.

So, the script that generates a user defined simulated DNA sequence set is

#!/usr/bin/env python

import random
import sys

def simulate_sequence(length):
    dna = ['A', 'C', 'G', 'T']
    sequence = ''
    for i in range(length):
        sequence += random.choice(dna)
    return sequence

setsize = int(sys.argv[1])
minlength = int(sys.argv[2])
maxlength = int(sys.argv[3])

sequenceset = []
for i in range(setsize):
    rlength = random.randint(minlength, maxlength)
    sequenceset.append(simulate_sequence(rlength))

for sequence in sequenceset:
    print sequence

Simple and efficient. First we define a function that generates a simulated DNA sequence from the four nucleotides, again using the random.choice. The sequence length is based on the parameter received by the function. The sequence length is another random number defined in the main body of the script. This random number is generated by random.randint with a range based on the arguments given by the user when running the script. And finally, the number of sequences to be simulated is define by the first parameter. Seventeen lines.

Next we will see how to draw some scientific information about the sequences, such as sequence identity and nucleotide frequency.

Random sequences in Python

Section 4 4 Comments »

There is a reason to say that Python has batteries included. On the last post we have seen a small example of randomization in Python, generating random integer values for a (extremely) simple dice game. We move to another example, still simple which will allow us to generate random DNA sequences. The script is

#!/usr/bin/env python

import random
import sys

length = int(sys.argv[1])

dnaseq = list('ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT')

result = ''
for i in range(length):
    result += random.choice(dnaseq)

print result

Not fancy at all, just plain simple (yet again). We import a couple of modules, sys and random, and ask for the sequence length as a script parameter. dnaseq is a list containing a tandem repeat of ACGT, and from there we will extract our random nucleotides. In fact dnaseq could have been ‘ACGT’ only. The fact that we create a string and convert it to a list, is just for convenience of writing 'ACGT...' easier than ['A', 'C' ...]. We then declare an empty string that will be used to store the random sequence. And inside the loop, the command that does all the magic: random.choice(<list>). This command will return a random element from the list passed as subject. Using it inside a loop we will get a random nucleotide on each iteration and add it to our string.

This is a very simple command, but at the same time extremely powerful and easy to implement. A good exercise from this would be to modify the dnaseq string and see if there is any change in the final random sequence.

Ramdomization

Section 4 Comments Off

We start section 4 with a very short introduction to randomization. Chapter 7 of BPB discuss the use of randomization to obtain mutations in DNA and protein sequences. The example given in the book is at the same time simple and interesting, as it creates a paragraph from random selections of nouns, adjectives, verbs and other grammar elements. Here we are going to to create a very (stress on very) simple dice game, where each time you run the script it will throw two dices for you and two dices for the computer. It then prints the sum of the dices and tells who won the match.

Randomization is an important feature of computer languages. There are different researchers involved in the creation of the best approaches to generate random number in computers. Random number are important in the simulation of different natural processes, such as genetic mutation, gene drift, epidemiology, weather forecast, etc. The better the generator, the better the simulation.

Python has a random module included, with a myriad of functions that can perform different randomizations. In this post we will see the integer randomization, and in later entries we will see some other powerful functions.

Our script is quite simple, and the only new aspect for us here is the random module and the randint function.

#!/usr/bin/env python

import random

dice1 = random.randint(1,6)
dice2 = random.randint(1,6)

computerdice1 = random.randint(1,6)
computerdice2 = random.randint(1,6)

mine = dice1 + dice2
his = computerdice1 + computerdice2

print 'mine = ' + str(mine) + ' vs. computer = ' + str(his)

if mine > his:
    print "I won"
elif mine < his:
    print "Computer won"
else:
    print "Tie. Try again"

random.randint is a function that generates an integer random number between a range specified by the number between parentheses. In the above case, we are using a dice of 6 values. So, for every call of random.randint(1,6) a random number between 1 and 6 will be generated. We store this number in a variable, sum the user’s dices and the computer’s and check with a if clause to see the winner. A good exercise would be to make this script interactive, allowing multiple matches.

Next we will see another random module function and then we will generate mutations on DNA sequences.

Design by j david macor.com.Original WP Theme & Icons by N.Design Studio
Entries RSS Comments RSS Log in