Beginning Perl for Bioinformatics


Combining the Subroutines to Simulate Mutation



Download 1.4 Mb.
Page14/28
Date conversion29.03.2017
Size1.4 Mb.
1   ...   10   11   12   13   14   15   16   17   ...   28

7.3.3 Combining the Subroutines to Simulate Mutation

Now you've got all your ducks in place, so you write your main program as in Example 7-2 and see if your new subroutine works.


Example 7-2. Mutate DNA

#!/usr/bin/perl

# Mutate DNA

# using a random number generator to randomly select bases to mutate


use strict;

use warnings;


# Declare the variables
# The DNA is chosen to make it easy to see mutations:

my $DNA = 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA';


# $i is a common name for a counter variable, short for "integer"

my $i;
my $mutant;


# Seed the random number generator.

# time|$$ combines the current time with the current process id

srand(time|$$);
# Let's test it, shall we?

$mutant = mutate($DNA);


print "\nMutate DNA\n\n";
print "\nHere is the original DNA:\n\n";

print "$DNA\n";


print "\nHere is the mutant DNA:\n\n";

print "$mutant\n";


# Let's put it in a loop and watch that bad boy accumulate mutations:

print "\nHere are 10 more successive mutations:\n\n";


for ($i=0 ; $i < 10 ; ++$i) {

$mutant = mutate($mutant);

print "$mutant\n";

}
exit;

################################################################################

# Subroutines for Example 7-2

################################################################################
# Notice, now that we have a fair number of subroutines, we

# list them alphabetically

# A subroutine to perform a mutation in a string of DNA

#

# WARNING: make sure you call srand to seed the


# random number generator before you call this function.
sub mutate {
my($dna) = @_;
my(@nucleotides) = ('A', 'C', 'G', 'T');
# Pick a random position in the DNA

my($position) = randomposition($dna);


# Pick a random nucleotide

my($newbase) = randomnucleotide(@nucleotides);


# Insert the random nucleotide into the random position in the DNA

# The substr arguments mean the following:

# In the string $dna at position $position change 1 character to

# the string in $newbase

substr($dna,$position,1,$newbase);
return $dna;

}
# A subroutine to randomly select an element from an array

#

# WARNING: make sure you call srand to seed the



# random number generator before you call this function.
sub randomelement {
my(@array) = @_;
return $array[rand @array];

}
# randomnucleotide

#

# A subroutine to select at random one of the four nucleotides



#

# WARNING: make sure you call srand to seed the

# random number generator before you call this function.
sub randomnucleotide {
my(@nucleotides) = ('A', 'C', 'G', 'T');
# scalar returns the size of an array.

# The elements of the array are numbered 0 to size-1

return randomelement(@nucleotides);

}
# randomposition

#

# A subroutine to randomly select a position in a string.



#

# WARNING: make sure you call srand to seed the

# random number generator before you call this function.
sub randomposition {
my($string) = @_;

# Notice the "nested" arguments:

#

# $string is the argument to length


# length($string) is the argument to rand

# rand(length($string))) is the argument to int

# int(rand(length($string))) is the argument to return

# But we write it without parentheses, as permitted.

#

# rand returns a decimal number between 0 and its argument.



# int returns the integer portion of a decimal number.

#

# The whole expression returns a random number between 0 and length-1,



# which is how the positions in a string are numbered in Perl.

#
return int rand length $string;

}

Here's some typical output from Example 7-2:


Mutate DNA
Here is the original DNA:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Here is the mutant DNA:
AAAAAAAAAAAAAAAAAAAAGAAAAAAAAA
Here are 10 more successive mutations:
AAAAAAAAAAAAAAAAAAAAGACAAAAAAA

AAAAAAAAAAAAAAAAAAAAGACAAAAAAA

AAAAAAAAAAAAAAAAAAAAGACAAAAAAA

AAAAAAAAAAAAAACAAAAAGACAAAAAAA

AAAAAAAAAAAAAACAACAAGACAAAAAAA

AAAAAAAAAAAAAACAACAAGACAAAAAAA

AAAAAAAAAGAAAACAACAAGACAAAAAAA

AAAAAATAAGAAAACAACAAGACAAAAAAA

AAAAAATAAGAAAACAACAAGACAAAAAAA

AAAAAATTAGAAAACAACAAGACAAAAAAA

Example 7-2 was something of a programming challenge, but you end up with the satisfaction of seeing your (simulated) DNA mutate. How about writing a graphical display for this, so that every time a base gets mutated, it makes a little explosion and the color gets highlighted, so you can watch it happening in real-time?

Before you scoff, you should know how important good graphical displays are for the success of most programs. This may be a trivial-sounding graphic, but if you can demonstrate the most common mutations in, for instance, the BRCA breast cancer genes in this way, it might be useful.

7.3.4 A Bug in Your Program?

To return to the business at hand, you may have noticed something when you looked over the output from Example 7-2. Look at the first two lines of the "10 more successive mutations." They are exactly the same! Could it be that after patting yourself on the back and telling yourself what a good bit of work you'd done, you've discovered a bug?


How can you track it down? You may want to step through the running of the program with the Perl debugger, which you saw in Chapter 6. However, this time, you stop and think about your design instead. You're replacing the bases at random positions with randomly chosen bases. Aha! Sometimes the base at the position you randomly choose is exactly the same as the base you randomly choose to plug into its place! You're replacing a base with itself on occasion![3]
[3] How often? In DNA that's all one base, it's happening 1/4 of the time. In DNA that's equally populated with the four bases, it's happening...1/4 of the time!
Let's say you decide that behavior is not useful. At each successive mutation, you need to see one base change. How can you alter your code to ensure that? Let's start with some pseudocode for the mutate subroutine:
Select a random position in the string of DNA
Repeat:
Choose a random nucleotide
Until: random nucleotide differs from the nucleotide in the random position
Substitute the random nucleotide into the random position in the DNA

This seems like something that should work, so you alter the mutate subroutine, calling it the mutate_better subroutine:


# mutate_better

#

# Subroutine to perform a mutation in a string of DNA--version 2, in which


# it is guaranteed that one base will change on each call

#

# WARNING: make sure you call srand to seed the


# random number generator before you call this function.
sub mutate_better {
my($dna) = @_;

my(@nucleotides) = ('A', 'C', 'G', 'T');


# Pick a random position in the DNA

my($position) = randomposition($dna);


# Pick a random nucleotide

my($newbase);


do {

$newbase = randomnucleotide(@nucleotides);


# Make sure it's different than the nucleotide we're mutating

}until ( $newbase ne substr($dna, $position,1) );


# Insert the random nucleotide into the random position in the DNA

# The substr arguments mean the following:

# In the string $dna at position $position change 1 character to

# the string in $newbase

substr($dna,$position,1,$newbase);
return $dna;

}

When you plug this subroutine in place of mutate and run the code, you get the following output:


Mutate DNA
Here is the original DNA:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Here is the mutant DNA:
AAAAAAAAAAAAATAAAAAAAAAAAAAAAA
Here are 10 more successive mutations:
AAAAAAAAAAAAATAAAAAAAACAAAAAAA

AAAAATAAAAAAATAAAAAAAACAAAAAAA

AAATATAAAAAAATAAAAAAAACAAAAAAA

AAATATAAAAAAATAAAAAAAACAACAAAA

AATTATAAAAAAATAAAAAAAACAACAAAA

AATTATTAAAAAATAAAAAAAACAACAAAA

AATTATTAAAAAATAAAAAAAACAACACAA

AATTATTAAAAAGTAAAAAAAACAACACAA

AATTATTAAAAAGTGAAAAAAACAACACAA

AATTATTAAAAAGTGATAAAAACAACACAA

which seems to indeed make a real change on every iteration.

Notice one more thing about declaring variables. In this code for mutate_better, if you'd declared $newbase within the loop, since the loop is enclosed in a block, the variable $newbase would not then be visible outside of that loop. In particular, it wouldn't be available in the substr call that does the actual base change for the mutation. So, in mutate_better, you had to declare the variable outside of the loop.

This is a frequent source of confusion for programmers who like to declare variables on the fly and a powerful argument for getting into the habit of collecting variable definitions together at the top of the program.
Even so, there are often times when you want to hide a variable within a block, because that's the only place where you will use it. Then you may want to do the declaration in the block . (Perhaps at the top of the block, if it's a long one?)
7.4 Generating Random DNA

It's often useful to generate random data for test purposes. Random DNA can also be used to study the organization of actual DNA from an organism. In this section, we'll write some programs to generate random DNA sequences.


Such random DNA sequences have proved useful in several ways. For instance, the popular BLAST program (see Chapter 12) depends on the properties of random DNA for the analytic and empirical results that underpin the sequence similarity scores, statistics that are used to rank the "hits" that BLAST returns to the user.
Let's assume what's needed is a set of random DNA fragments of varying length. Your program will have to specify a maximum and a minimum length, as well as how many fragments to generate.
7.4.1 Bottom-up Versus Top-down

In Example 7-2, you wrote the basic subroutines, then a subroutine that called the basic subroutines, and finally the main program. If you ignore the pseudocode, this is an example of bottom-up design; start with the building blocks, then assemble them into a larger structure.

Now let's see what it's like to start with the main program, with its subroutine calls, and write the subroutines after you find a need for them. This is called top-down design.

7.4.2 Subroutines for Generating a Set of Random DNA

Given our goal of generating random DNA, perhaps what you want is a data-generating subroutine:


@random_DNA = make_random_DNA_set( $minimum_length, $maximum_length, $size_of_set );

This looks okay, but of course, it begs the question of how to actually accomplish the overall task. (That's top-down design for you!) So you need to move down and write pseudocode for the make_random_DNA_set subroutine:


repeat $size_of_set times:
$length = random number between minimum and maximum length
$dna = make_random_DNA ( $length );
add $dna to @set

}
return @set

Now, continuing the top-down design, you need some pseudocode for the make_random_DNA subroutine:
from 1 to $size
$base = randomnucleotide
$dna .= $base

}
return $dna

Don't go any further: you've already got a randomnucleotide subroutine from Example 7-2.
(Are you bothered by the absence of balanced curly braces in the pseudocode? Here, you're relying on indentation and lining up the right braces to indicate the blocks. Since it's pseudocode, anything is allowed as long as it works.)
7.4.3 Turning the Design into Code

Now that we've got a top-down design, how to proceed with the coding? Let's follow the top-down design, just to see how it works.


Example 7-3 starts with the main program and proceeds, following the order of the top-down design you did in pseudocode, then followed by the subroutines.

Example 7-3. Generate random DNA

#!/usr/bin/perl

# Generate random DNA

# using a random number generator to randomly select bases

use strict;

use warnings;


# Declare and initialize the variables

my $size_of_set = 12;

my $maximum_length = 30;

my $minimum_length = 15;


# An array, initialized to the empty list, to store the DNA in

my @random_DNA = ( );


# Seed the random number generator.

# time|$$ combines the current time with the current process id

srand(time|$$);
# And here's the subroutine call to do the real work

@random_DNA = make_random_DNA_set( $minimum_length, $maximum_length, $size_of_set );


# Print the results, one per line

print "Here is an array of $size_of_set randomly generated DNA sequences\n";

print " with lengths between $minimum_length and $maximum_length:\n\n";
foreach my $dna (@random_DNA) {
print "$dna\n";

}
print "\n";


exit;
################################################################################

# Subroutines

################################################################################
# make_random_DNA_set

#

# Make a set of random DNA



#

# Accept parameters setting the maximum and minimum length of

# each string of DNA, and the number of DNA strings to make

#

# WARNING: make sure you call srand to seed the



# random number generator before you call this function.
sub make_random_DNA_set {
# Collect arguments, declare variables

my($minimum_length, $maximum_length, $size_of_set) = @_;


# length of each DNA fragment

my $length;

# DNA fragment

my $dna;
# set of DNA fragments

my @set;

# Create set of random DNA

for (my $i = 0; $i < $size_of_set ; ++$i) {

# find a random length between min and max

$length = randomlength ($minimum_length, $maximum_length);


# make a random DNA fragment

$dna = make_random_DNA ( $length );


# add $dna fragment to @set

push( @set, $dna );

}
return @set;

}
# Notice that we've just discovered a new subroutine that's

# needed: randomlength, which will return a random

# number between (or including) the min and max values.

# Let's write that first, then do make_random_DNA
# randomlength

#

# A subroutine that will pick a random number from



# $minlength to $maxlength, inclusive.

#

# WARNING: make sure you call srand to seed the



# random number generator before you call this function.
sub randomlength {
# Collect arguments, declare variables

my($minlength, $maxlength) = @_;


# Calculate and return a random number within the

# desired interval.

# Notice how we need to add one to make the endpoints inclusive,

# and how we first subtract, then add back, $minlength to

# get the random number in the correct interval.

return ( int(rand($maxlength - $minlength + 1)) + $minlength );

}
# make_random_DNA

#

# Make a string of random DNA of specified length.



#

# WARNING: make sure you call srand to seed the

# random number generator before you call this function.
sub make_random_DNA {
# Collect arguments, declare variables

my($length) = @_;


my $dna;
for (my $i=0 ; $i < $length ; ++$i) {
$dna .= randomnucleotide( );

}
return $dna;

}
# We also need to include the previous subroutine

# randomnucleotide.

# Here it is again for completeness.

# randomnucleotide

#

# Select at random one of the four nucleotides


#

# WARNING: make sure you call srand to seed the

# random number generator before you call this function.
sub randomnucleotide {
my(@nucleotides) = ('A', 'C', 'G', 'T');
# scalar returns the size of an array.

# The elements of the array are numbered 0 to size-1

return randomelement(@nucleotides);

}
# randomelement

#

# randomly select an element from an array



#

# WARNING: make sure you call srand to seed the

# random number generator before you call this function.
sub randomelement {
my(@array) = @_;
return $array[rand @array];

}

Here's the output from Example 7-3:


Here is an array of 12 randomly generated DNA sequences

with lengths between 15 and 30:


TACGCTTGTGTTTTCGGGGGAC

GGGGTGTGGTAAGGCTGTCTCAGATGTGC

TGAACGACAACCTCCTGGACTTTACT

ATCTATGCTTTGCCATGCTAGT

CCGCTCATTCCTCTTCCTCGGC

TGTACCCCTAATACACTTTAGCCGAATTTA

ATAGGTCGGGGCGACAGCGCCGG

GATTGACCTCTGTAA

AAAATCTCTAGGATCGAGC

GTATGTGCTTGGGTAAAT

ATGGAGTTGCGAGGAAGTAGCTGAGT

GGCCCATGACCAGCATCCAGACAGCA

7.5 Analyzing DNA

In this final example dealing with randomization, you'll collect some statistics on DNA in order to answer the question: on average, what percentage of bases are the same between two random DNA sequences? Although some simple mathematics can answer the question for you, the point of the program is to show that you now have the necessary programming ability to ask and answer questions about your DNA sequences. (If you were using real DNA, say a collection of some particular gene as it appears in several organisms in slightly different forms, the answer would be somewhat more interesting. You may want to try that later.)

So let's generate a set of random DNA, all the same length, then ask the following question about the set. What's the average percentage of positions that are the same between pairs of DNA sequences in this set?
As usual, let's try to sketch an idea of the program in pseudocode:
Generate a set of random DNA sequences, all the same length
For each pair of DNA sequences
How many positions in the two sequences are identical as a fraction?
}
Report the mean of the preceding calculations as a percentage

Clearly, to write this code, you can reuse at least some of the work you've already done. You certainly know how to generate a set of random DNA sequences. Also, although you don't have a subroutine that compares, position by position, the bases in two sequences, you know how to look at the positions in DNA strings. So that subroutine shouldn't be hard to write. In fact, let's write some pseudocode that compares each nucleotide in one sequence with the nucleotide in the same position in another sequence:


assuming DNA1 is the same length as DNA2,
for each position from 1 to length(DNA)
if the character at that position is the same in DNA_1 and DNA_2
++$count

}

}


return count/length

The whole problem now seems eminently do-able. You also have to write the code that picks each pair of sequences, collects the results, and finally takes the mean of the results and report it as a percentage. That can all go into the main program. Example 7-4 gives it a try, all in one shot.

Example 7-4. Calculate average % identity between pairs of random DNA sequences

#!/usr/bin/perl

# Calculate the average percentage of positions that are the same

# between two random DNA sequences, in a set of 10 sequences.

use strict;

use warnings;


# Declare and initialize the variables

my $percent;

my @percentages;

my $result;


# An array, initialized to the empty list, to store the DNA in

my @random_DNA = ( );


# Seed the random number generator.

# time|$$ combines the current time with the current process id

srand(time|$$);
# Generate the data set of 10 DNA sequences.

@random_DNA = make_random_DNA_set( 10, 10, 10 );


# Iterate through all pairs of sequences

for (my $k = 0 ; $k < scalar @random_DNA - 1 ; ++$k) {

for (my $i = ($k + 1) ; $i < scalar @random_DNA ; ++$i) {
# Calculate and save the matching percentage

$percent = matching_percentage($random_DNA[$k], $random_DNA[$i]);

push(@percentages, $percent);

}

}


# Finally, the average result:

$result = 0;


foreach $percent (@percentages) {

$result += $percent;

}
$result = $result / scalar(@percentages);

#Turn result into a true percentage

$result = int ($result * 100);
print "In this run of the experiment, the average percentage of \n";

print "matching positions is $result%\n\n";


exit;
################################################################################

# Subroutines

################################################################################
# matching_percentage

#

# Subroutine to calculate the percentage of identical bases in two



# equal length DNA sequences
sub matching_percentage {
my($string1, $string2) = @_;

# we assume that the strings have the same length

my($length) = length($string1);

my($position);

my($count) = 0;

for ($position=0; $position < $length ; ++$position) {

if(substr($string1,$position,1) eq substr($string2,$position,1)) {

++$count;

}

}


return $count / $length;

}
# make_random_DNA_set

#

# Subroutine to make a set of random DNA



#

# Accept parameters setting the maximum and minimum length of

# each string of DNA, and the number of DNA strings to make

#

# WARNING: make sure you call srand to seed the



# random number generator before you call this function.
sub make_random_DNA_set {
# Collect arguments, declare variables

my($minimum_length, $maximum_length, $size_of_set) = @_;


# length of each DNA fragment

my $length;

# DNA fragment

my $dna;
# set of DNA fragments

my @set;
# Create set of random DNA

for (my $i = 0; $i < $size_of_set ; ++$i) {


# find a random length between min and max

$length = randomlength ($minimum_length, $maximum_length);


# make a random DNA fragment

$dna = make_random_DNA ( $length );


# add $dna fragment to @set

push( @set, $dna );

}
return @set;

}
# randomlength

#

# A subroutine that will pick a random number from



# $minlength to $maxlength, inclusive.

#

# WARNING: make sure you call srand to seed the



# random number generator before you call this function.
sub randomlength {
# Collect arguments, declare variables

my($minlength, $maxlength) = @_;

# Calculate and return a random number within the

# desired interval.

# Notice how we need to add one to make the endpoints inclusive,

# and how we first subtract, then add back, $minlength to

# get the random number in the correct interval.

return ( int(rand($maxlength - $minlength + 1)) + $minlength );

}

# make_random_DNA

#

# Make a string of random DNA of specified length.



#

# WARNING: make sure you call srand to seed the

# random number generator before you call this function.
sub make_random_DNA {
# Collect arguments, declare variables

my($length) = @_;


my $dna;
for (my $i=0 ; $i < $length ; ++$i) {

$dna .= randomnucleotide( );

}
return $dna;

}
# randomnucleotide

#

# Select at random one of the four nucleotides



#

# WARNING: make sure you call srand to seed the

# random number generator before you call this function.
sub randomnucleotide {
my(@nucleotides) = ('A', 'C', 'G', 'T');
# scalar returns the size of an array.

# The elements of the array are numbered 0 to size-1

return randomelement(@nucleotides);

}
# randomelement

#

# randomly select an element from an array



#

# WARNING: make sure you call srand to seed the

# random number generator before you call this function.
sub randomelement {
my(@array) = @_;
return $array[rand @array];

}

If the code in Example 7-4 seems somewhat repetitive of code from previous examples, it is. In the interest of presentation, I included the subroutine code in the program. (You'll start using modules in Chapter 8 as a way to avoid this repetition.)


Here's the output of Example 7-4:

In this run of the experiment, the average number of

matching positions is 0.24%

Well, that seems reasonable. You might say, it's obvious: a quarter of the positions match, and there are four bases. But the point isn't to verify elementary probability, it's to show you have enough programming under your belt to write some programs that ask and answer questions about DNA sequences.



1   ...   10   11   12   13   14   15   16   17   ...   28


The database is protected by copyright ©hestories.info 2017
send message

    Main page