Beginning Perl for Bioinformatics

Chapter 7. Mutations and Randomization


Download 1.4 Mb.
Date conversion29.03.2017
Size1.4 Mb.
1   ...   9   10   11   12   13   14   15   16   ...   28

Chapter 7. Mutations and Randomization

As every biologist knows, mutation is a fundamental topic in biology. Mutations in DNA occur all the time in cells. Most of them don't affect the actions of proteins and are benign. Some of them do affect the proteins and may result in diseases such as cancer. Mutations can also lead to nonviable offspring that dies during development; occasionally they can lead to evolutionary change. Many cells have very complex mechanisms to repair mutations.

Mutations in DNA can arise from radiation, chemical agents, replication errors, and other causes. We're going to model mutations as random events, using Perl's random number generator.
Randomization is a computer technique that crops up regularly in everyday programs, most commonly in cryptography, such as when you want to generate a hard-to-guess password. But it's also an important branch of algorithms: many of the fastest algorithms employ randomization.
Using randomization, it's possible to simulate and investigate the mechanisms of mutations in DNA and their effect upon the biological activity of their associated proteins. Simulation is a powerful tool for studying systems and predicting what they will do; randomization allows you to better simulate the "ordered chaos" of a biological system. The ability to simulate mutations with computer programs can aid in the study of evolution, disease, and basic cellular processes such as division and DNA repair mechanisms. Computer models of cell development and function, now in their early stages, will become much more accurate and useful in coming years, and mutation is a basic biological mechanism these models will incorporate.

From the standpoint of programming technique, as well as from the standpoint of modeling evolution, mutation, and disease, randomization is a powerful—and, luckily for us, easy-to-use—programming skill.

Here's a breakdown of what we will accomplish in this chapter:
Randomly select an index into an array and a position in a string: these are the basic tools for picking random locations in DNA (or other data)
Model mutation with random numbers by learning how to randomly select a nucleotide in DNA and then mutate it to some other (random) nucleotide
Use random numbers to generate DNA sequence data sets, which can be used to study the extent of randomness in actual genomes
Repeatedly mutate DNA to study the effect of mutations accumulating over time during evolution
7.1 Random Number Generators

A random number generator is a subroutine you can call. For most practical purposes, you needn't worry about what's inside it. The values you get for random numbers on the computer differ somewhat from the values of real-world random events as measured, for example, by detecting nuclear decay events. Some computers actually have devices such as geiger counters attached so as to have a source of truly random events. But I'd be willing to bet your computer doesn't. What you have in place of a geiger counter, is an algorithm called a random number generator.

The numbers that are output by random number generators are not really random; they are thus called pseudo-random numbers. A random number generator, being an algorithm, is predictable. A random number generator needs a seed, an input you can change to get a different series of (pseudo-)random numbers.
The numbers from a random number generator give an even distribution of values. This is one of the most important characteristics of randomness and largely justifies the use of these algorithms where some amount of random behavior is desired.

The other "take-home message" about random number generators is that the seed you start them up with should itself be selected randomly. If you seed with the same number every time, you'll get the same sequence of "random numbers" every time as well. (Not very random!) Try to pick a seed that has some randomness in it, such as a number calculated from some computer event that changes haphazardly over time.[1]

[1] Even here, for critical applications, you're not out of the woods. Unless you pick your seeds carefully, hackers will figure out how you're picking them and crack your random numbers and therefore your passwords. The method used to generate seeds in this chapter, time|$$, is crackable by dedicated hackers. A better choice is time() ^ ($$+<<15)). If program security is important, you should consult the Perl documentation, and the Math::Random and Math::TrulyRandom modules from CPAN
In the examples that follow, I use a simple method for seed picking that's okay for most purposes. If you use random numbers for data encryption with critical privacy issues (such as patient records), you should read further into the Perl documentation about the several advanced options Perl provides for random number generation. In this book, I use a Perl method that is good enough for most purposes.
7.2 A Program Using Randomization

Example 7-1 introduces randomization in the context of a simple program. It randomly combines parts of sentences to construct a story. This isn't a bioinformatics program, but I've found that it's an effective way to learn the basics of randomization. You will learn how to randomly select elements from arrays, which you'll apply in the future examples that mutate DNA.

The example declares a few arrays filled with parts of sentences, then randomizes their assembly into complete sentences. It's a trivial children's game; yet it teaches several programming points.

Example 7-1. Children's game with random numbers


# Children's game, demonstrating primitive artificial intelligence,

# using a random number generator to randomly select parts of sentences.

use strict;

use warnings;

# Declare the variables

my $count;

my $input;

my $number;

my $sentence;

my $story;

# Here are the arrays of parts of sentences:

my @nouns = (







'Robin Hood',

'Joe and Moe',

my @verbs = (

'ran to',

'giggled with',

'put hot sauce into the orange juice of',



'sang stupid songs with',

'jumped with',

my @prepositions = (

'at the store',

'over the rainbow',

'just for the fun of it',

'at the beach',

'before dinner',

'in New York City',

'in a dream',

'around the world',

# Seed the random number generator.

# time|$$ combines the current time with the current process id

# in a somewhat weak attempt to come up with a random seed.

# This do-until loop composes six-sentence "stories".

# until the user types "quit".

do {

# (Re)set $story to the empty string each time through the loop

$story = '';
# Make 6 sentences per story.

for ($count = 0; $count < 6; $count++) {

# Notes on the following statements:

# 1) scalar @array gives the number of elements in the array.

# 2) rand returns a random number greater than 0 and

# less than scalar(@array).

# 3) int removes the fractional part of a number.

# 4) . joins two strings together.

$sentence = $nouns[int(rand(scalar @nouns))]

. " "

. $verbs[int(rand(scalar @verbs))]

. " "

. $nouns[int(rand(scalar @nouns))]

. " "

. $prepositions[int(rand(scalar @prepositions))]

. '. ';
$story .= $sentence;

# Print the story.

print "\n",$story,"\n";

# Get user input.

print "\nType \"quit\" to quit, or press Enter to continue: ";

$input = ;
# Exit loop at user's request

} until($input =~ /^\s*q/i);


Here is some typical output from Example 7-1:

Joe and Moe jumped with Rebecca in New York City. Rebecca exploded Groucho

in a dream. Mom ran to Harpo over the rainbow. TV giggled with Joe and Moe

over the rainbow. Harpo exploded Joe and Moe at the beach. Robin Hood giggled

with Harpo at the beach.

Type "quit" to quit, or press Enter to continue:
Harpo put hot sauce into the orange juice of TV before dinner. Dad ran to

Groucho in a dream. Joe and Moe put hot sauce into the orange juice of TV

in New York City. Joe and Moe giggled with Joe and Moe over the rainbow. TV

put hot sauce into the orange juice of Mom just for the fun of it. Robin Hood

ran to Robin Hood at the beach.
Type "quit" to quit, or press Enter to continue: quit

The structure of the example is quite simple. After enforcing the declarations of variables, and turning on warnings, with:

use strict;

use warnings;

the variables are declared, and the arrays are initialized with values.

7.2.1 Seeding the Random Number Generator

Next, the random number generator is seeded by a call to the built-in function srand. It takes one argument, the seed for the random number generator discussed earlier. As mentioned, you have to give a different seed at this step to get a different series of random numbers. Try changing this statement to something like:


and then run the program more than once. You'll get the same results each time.[2] The seed you're using:

[2] The latest random number generators automatically change the series, so if this experiment doesn't work, you're probably using a very new random number generator. However, sometimes you want to repeat a series. Note that newer versions of Perl automatically give you a good seed if you call srand like so: srand;.

is a calculation that returns a different seed each time.

time returns a number representing the time, $$ returns a number representing the ID of the Perl program that's running (this typically changes each time you run the program), and | means bitwise OR and combines the bits of the two numbers (for details see the Perl documentation). There are other ways to pick a seed, but let's stick with this popular one.
7.2.2 Control Flow

The main loop of the program is a do-until loop. These loops are handy when you want to do something (like print a little story) before taking any actions (like asking the user if he wants to continue) each time through the loop. The do-until loop first executes the statements in the block and then performs a test to determine if it should repeat the statements in the block. Note that this is the reverse of the other types of loops you've seen that do the test first and then the block.

Since the $story variable is always being appended to, it needs to be emptied at the top of each loop. It's common to forget that variables that are increased in some way need to be reset at the correct spot, so watch for that in your programming. The clue is increasingly long strings or big numbers.

The for loop contains the main work of the program. As you've seen before, this loop initializes a counter, performs a test, and then increments the counter at the end of the block.

7.2.3 Making a Sentence

In Example 7-1, note that the statement that makes a sentence stretches out over a few lines of code. It's a bit complicated, and it's the real work of the whole program, so there are comments attached to help read it. Notice that the statement has been carefully formatted so that it's neatly laid out over its eight lines. The variable names have been well chosen, so it's clear that you're making a sentence out of a noun, a verb, a noun, and a prepositional phrase.

However, even with all that, there are rather deeply nested expressions within the square brackets that specify the array positions, and it requires a bit of scrutiny to read this code. You will see that you're building a string out of sentence parts separated by spaces and ending with a period and a space. The string is built by several applications of the dot string concatenation operator. These have been placed at the beginning of each line to clarify the overall structure of the statement.
7.2.4 Randomly Selecting an Element of an Array

Let's look closely at one of the sentence part selectors:

$verbs[int(rand(scalar @verbs))]

These kinds of nested braces need to be read and evaluated from the inside out. So the expression that's most deeply surrounded by braces is:

scalar @verbs

You see from the comments before the statement that the built-in function scalar returns the number of elements in an array. The array in question, @verbs, has seven elements, so this expression returns 7.

So now you have:

and the most deeply nested expression is now:


The helpful comments in the code before the statement remind you that this statement returns a (pseudo)random number greater than 0 and less than 7. This number is a floating-point number (decimal number with a fraction). Recall that an array with seven elements will number them from 0 to 6.

So now you have something like this:

and you want to evaluate the expression:


The int function discards the fractional part of a floating-point number and returns just the integer part, in this case 3.

So you've come to the final step:
which gives you the fourth element of the @verbs array, as the comments have been kind enough to remind you.
7.2.5 Formatting

To randomly select a verb, you call a few functions:


Determines the size of the array


Picks a random number in the range determined by the size of the array


Transforms the floating-point number rand returns into the integer value you need for an array element

Several of these function calls are combined in one line using nested braces. Sometimes this produces hard-to-read code, and the gentle reader may be nodding his or her head vigorously at this unflattering characterization of the author's painstaking handiwork. You could try rewriting these lines, using additional temporary variables. For instance, you can say:
$verb_array_size = scalar @verbs;

$random_floating_point = rand ( $verb_array_size );

$random_integer = int $random_floating_point;

$verb = $verbs[$random_integer];

and repeat for the other parts of speech, finally building your sentence with a statement such as:

$sentence = "$subject $verb $object $prepositional_phrase. ";

It's a matter of style. You will make these kinds of choices all the time as you program. The choice of layout in Example 7-1 was based on a tradeoff between a desire to express the overall task clearly (which won) balanced against the difficulty of reading highly nested function calls (which lost). Another reason for this layout choice is that, in the programs that follow, you'll select random elements in arrays with some regularity, so you'll get used to seeing this particular nesting of calls. In fact, perhaps you should make a little subroutine out of this kind of call if you will do the same thing many times?

Readability is the most important thing here, as it is in most code. You have to be able to read and understand code, your own as well as the code of others, and that is usually more important than trying to achieve other laudable goals such as fastest speed, smallest amount of memory used, or shortest program. It's not always important, but usually it's best to write for readability first, then go back and try to goose up the speed (or whatever) if necessary. You can even leave the more readable code in there as comments, so whoever has to read the code can still get a clear idea of the program and how you went about improving the speed (or whatever).
7.2.6 Another Way to Calculate the Random Position

Perl often has several ways to accomplish a task. the following is an alternate way to write this random number selection; it uses the same function calls but without the parentheses:

$verbs[int rand scalar @verbs]

This chaining of functions, each of which takes one argument, is common in Perl. To evaluate the expression, Perl first takes @verbs as an argument to scalar, which returns the size of the array. Then it takes that value as an argument to rand, which returns a floating-point number from 0 to less than the size of the array. It then uses that floating-point number as an argument to int, which returns the greatest integer less than the floating-point number. In other words, it calculates the same number to be used as the subscript for the array @verbs.

Why does Perl allow this? Because such calculations are very frequent, and, in the spirit of "Let the computer do the work," Perl designer Larry Wall decided to save you (and himself) the bother of typing and matching all those parentheses.

Having gone that far, Larry decided it'd be easy to add even more. You can eliminate the scalar and the int function calls and use:

$verbs[rand @verbs]
What's going on here? Since rand already expects a scalar value, it evaluates @verbs in a scalar context, which simply returns the size of the array. Larry cleverly designed array subscripts (which, of course, are always integer values) to automatically take just the integer part of a floating-point value if it was given as a subscript; so, out with the int.
7.3 A Program to Simulate DNA Mutation

Example 7-1 gave you the tools you'll need to mutate DNA. In the following examples, you'll represent DNA, as usual, by a string made out of the alphabet A, C, G, and T. You'll randomly select positions in the string and then use the substr function to alter the DNA.

This time, let's go about things a little differently and first compose some of the useful subroutines you'll need before showing the whole program.
7.3.1 Pseudocode Design

Starting with simple pseudocode, here's a design for a subroutine that mutates a random position in DNA to a random nucleotide:

Select a random position in the string of DNA.
Choose a random nucleotide.
Substitute the random nucleotide into the random position in the DNA.
This seems short and to the point. So you decide to make each of the first two sentences into a subroutine. Select a random position in a string

How can you randomly select a position in a string? Recall that the built-in function length returns the length of a string. Also recall that positions in strings are numbered from 0 to length-1, just like positions in arrays. So you can use the same general idea as in Example 7-1, and make a subroutine:

# randomposition


# A subroutine to randomly select a position in a string.


# WARNING: make sure you call srand to seed the

# random number generator before you call this function.
sub randomposition {
my($string) = @_;
# This expression returns a random number between 0 and length-1,

# which is how the positions in a string are numbered in Perl.

return int(rand(length($string)));


randomposition is really a short function, if you don't count the comments. It's just like the idea in Example 7-1 to select a random array element.

Of course, if you were really writing this code, you'd make a little test to see if your subroutine worked:
#!/usr/bin/perl -w

# Test the randomposition subroutine

for (my $i=0 ; $i < 20 ; ++$i ) {

print randomposition($dna), " ";

print "\n";
sub randomposition {

my($string) = @_;

return int rand length $string;


Here's some representative output of the test (your results should vary):

28 26 20 1 29 7 1 27 2 24 8 1 23 7 13 14 2 12 13 27

Notice the new look of the for loop:

for (my $i=0 ; $i < 20 ; ++$i ) {

This shows how you can localize the counter variables (in this case, $i) to the loop by declaring them with my inside the for loop. Choose a random nucleotide

Next, let's write a subroutine that randomly chooses one of the four nucleotides:

# randomnucleotide


# A subroutine to randomly select a nucleotide


# WARNING: make sure you call srand to seed the

# random number generator before you call this function.
sub randomnucleotide {
my(@nucs) = @_;
# scalar returns the size of an array.

# The elements of the array are numbered 0 to size-1

return $nucs[rand @nucs];


Again, this subroutine is short and sweet. (Most useful subroutines are; although writing a short subroutine is no guarantee it will be useful. In fact, you'll see in a bit how you can improve this one.)

Let's test this one too:
#!/usr/bin/perl -w

# Test the randomnucleotide subroutine

my @nucleotides = ('A', 'C', 'G', 'T');
for (my $i=0 ; $i < 20 ; ++$i ) {

print randomnucleotide(@nucleotides), " ";

print "\n";
sub randomnucleotide {

my(@nucs) = @_;

return $nucs[rand @nucs];


Here's some typical output (it's random, of course, so there's a high probability your output will differ):

C A A A A T T T T T A C A C T A A G G G Place a random nucleotide into a random position

Now for the third and final subroutine, that actually does the mutation. Here's the code:

# mutate


# A subroutine to perform a mutation in a string of DNA

sub mutate {
my($dna) = @_;

my(@nucleotides) = ('A', 'C', 'G', 'T');

# Pick a random position in the DNA

my($position) = randomposition($dna);

# Pick a random nucleotide

my($newbase) = randomnucleotide(@nucleotides);

# Insert the random nucleotide into the random position in the DNA.

# The substr arguments mean the following:

# In the string $dna at position $position change 1 character to

# the string in $newbase


return $dna;


Here, again, is a short program. As you look it over, notice that it's relatively easy to read and understand. You mutate by picking a random position then selecting a nucleotide at random and substituting that nucleotide at that position in the string. (If you've forgotten how substr works, refer to Appendix B or other Perl documentation. If you're like me, you probably have to do that a lot, especially to get the order of the arguments right.)

There's a slightly different style used here for declaring variables. Whereas you've been declaring them at the beginning of a program, here you're declaring each variable the first time it's used. There are pros and cons for each programming style. Having all the variables at the top of the program gives good organization and can help in reading; declaring them on-the-fly can seem like a more natural way to write. The choice is yours.
Also, notice how this subroutine is mostly built from other subroutines, with a little bit added. That has a lot to do with its readability. At this point, you may be thinking that you've actually decomposed the problem pretty well, and the pieces are fairly easy to build and, in the end, they fit together well. But do they?
7.3.2 Improving the Design

You're about to pat yourself on the back for writing the program so quickly, but you notice something. You keep having to declare that pesky @nucleotides array and then pass it in to the randomnucleotide subroutine. But the only place you use the array is inside the randomnucleotide subroutine. So why not change your design a little? Here's a new try:

# randomnucleotide


# A subroutine to randomly select a nucleotide


# WARNING: make sure you call srand to seed the

# random number generator before you call this function.

sub randomnucleotide {

my(@nucs) = ('A', 'C', 'G', 'T');

# scalar returns the size of an array.

# The elements of the array are numbered 0 to size-1

return $nucs[rand @nucs];


Notice that this function now has no arguments. It's called like so:

$randomnucleotide = randomnucleotide( );

It's asking for a random element from a very specific set. Of course, you're always thinking, and you say, "It'd be handy to have a subroutine that randomly selects an element from any array. I might not need it right now, but I bet I'll need it soon!" So you define two subroutines instead of one:

# randomnucleotide


# A subroutine to randomly select a nucleotide


# WARNING: make sure you call srand to seed the

# random number generator before you call this function.
sub randomnucleotide {

my(@nucleotides) = ('A', 'C', 'G', 'T');

# scalar returns the size of an array.

# The elements of the array are numbered 0 to size-1

return randomelement(@nucleotides);

# randomelement


# A subroutine to randomly select an element from an array


# WARNING: make sure you call srand to seed the

# random number generator before you call this function.
sub randomelement {
my(@array) = @_;

return $array[rand @array];


Look back and notice that you didn't have to change your subroutine mutate; just the internal workings of randomnucleotide changed, not its behavior.

1   ...   9   10   11   12   13   14   15   16   ...   28

The database is protected by copyright © 2017
send message

    Main page