Beginning Perl for Bioinformatics



Download 1.4 Mb.
Page9/28
Date conversion29.03.2017
Size1.4 Mb.
1   ...   5   6   7   8   9   10   11   12   ...   28

5.3.3 do-until Loops

There's a new kind of loop in Example 5-3, the do-until loop, which first executes a block and then does a conditional test. Sometimes this is more convenient than the usual order in which you test first, then do the block if the test succeeds. Here, you want to prompt the user, get the user's input, search for the motif, and report the results. Before doing it again, you check the conditional test to see if the user has input an empty line. This means that the user has no more motifs to look for, so you exit the loop.



5.3.4 Regular Expressions

Regular expressions let you easily manipulate strings of all sorts, such as DNA and protein sequence data. What's great about regular expressions is that if there's something you want to do with a string, you usually can do it with Perl regular expressions.

Some regular expressions are very simple. For instance, you can just use the exact text of what you're searching for as a regular expression: if I was looking for the word "bioinformatics" in the text of this book, I could use the regular expression:

/bioinformatics/

Some regular expressions can be more complex, however. In this section, I'll explain their use in Example 5-3.


5.3.4.1 Regular expressions and character classes

Regular expressions are ways of matching one or more strings using special wildcard-like operators. Regular expressions can be as simple as a word, which matches the word itself, or they can be complex and made to match a large set of different words (or even every word!).

After you join the protein sequence data into the scalar variable $protein in Example 5-3, you also need to remove newlines and anything else that's not sequence data. This can include numbers on the lines, comments, informational or "header" lines, and so on. In this case, you want to remove newlines and any spaces or tabs that might be invisibly present. The following line of code in Example 5-3 removes this whitespace:

$protein =~ s/\s//g;

The sequence data in the scalar variable $protein is altered by this statement. You first saw the binding operator =~ and the substitute function s/// back in Example 4-3, where they were used to change one character into another. Here, they're used a little differently. You substitute any one of a set of whitespace characters, represented by \s with nothing and by the lack of anything between the second and third forward slashes. In other words, you delete any of a set of whitespace characters, which is done globally throughout the string by virtue of the g at the end of the statement.

The \s is one of several metasymbols. You've already seen the metasymbol \n. The \s metasymbol matches any space, tab, newline, carriage return, or formfeed. \s can also be written as:

[ \t\n\f\r]

This expression is an example of a character class and is enclosed in square brackets. A character class matches one character, any one of the characters named within the square brackets. A space is just typed as a space; other whitespace characters have their own metasymbols: \t for tab, \n for newline, \f for formfeed, and \r for carriage return. A carriage return causes the next character to be written at the beginning of the line, and a formfeed advances to the next line. The two of them together amount to the same thing as a newline character.

Each s/// command I've detailed has some kind of regular expression between the first two forward slashes /. You've seen single letters as the C in s/C/G/g in that position. The C is an example of a valid regular expression.

There's another use of regular expressions in Example 5-3. The line of code:

if ( $motif =~ /^\s*$/ ) {

is, in English, testing for a blank line in the variable $motif. If the user input is nothing except for perhaps some whitespace, represented as \s*, the match succeeds, and the program exits. The whole regular expression is:

/^\s*$/

which translates as: match a string that, from the beginning (indicated by the ^), is zero or more (indicated by the *) whitespace characters (indicated by the \s) until the end of the string (indicated by the $).

If this seems somewhat cryptic, just hang in there and you'll soon get familiar with the terminology. Regular expressions are a great way to manipulate sequence and other text-based data, and Perl is particularly good at making regular expressions relatively easy to use, powerful, and flexible. Many of the references in Appendix A contain material on regular expressions, and there's a concise summary in Appendix B.


5.3.4.2 Pattern matching with =~ and regular expressions

The actual search for the motif happens in this line from Example 5-3:

if ( $protein =~ /$motif/ ) {

Here, the binding operator =~ searches for the regular expression stored as the value of the variable $motif in the protein $protein. Using this feature, you can interpolate the value of a variable into a string match. (Interpolation in Perl strings means inserting the value of a variable into a string, as you first saw in Example 4-2 when you were concatenating strings). The actual motif, that is, the value of the string variable $motif, is your regular expression. The simplest regular expressions are just strings of characters, such as the motif AQQK, for example.

You can use Example 5-3 to play with some more features of regular expressions. You can type in any regular expression to search for in the protein. Try starting up the program, referring to the documentation on regular expressions, and play! Here are some examples of typing in regular expressions:


  • Search for an A followed by a D or S, followed by a V:

Enter a motif to search for: A[DS]V

I couldn't find it.



  • Search for K, N, zero or more D's, and two or more E's (note that {2,} means "two or more"):

Enter a motif to search for: KND*E{2,}

I found it!



  • Search for two E's, followed by anything, followed by another two E's:

Enter a motif to search for: EE.*EE

I found it!

In that last search, notice that a period stands for any character except a newline, and ".*" stands for zero or more such characters. (If you want to actually match a period, you have to escape it with a backslash.)


5.4 Counting Nucleotides

There are many things you might want to know about a piece of DNA. Is it coding or noncoding?[3] Does it contain a regulatory element? Is it related to some other known DNA, and if so, how? How many of each of the four nucleotides does the DNA contain? In fact, in some species the coding regions have a specific nucleotide bias, so this last question can be important in finding the genes. Also, different species have different patterns of nucleotide usage. So counting nucleotides can be interesting and useful.



[3] Coding DNA is DNA that codes for a protein, that is, it is part of a gene. In many organisms, including humans, a large part of the DNA is noncoding—not part of genes and doesn't code for proteins. In humans, about 98-99% of DNA is noncoding.

In the following sections are two programs, Examples 5-4 and 5-6, that make a count of each type of nucleotide in some DNA. They introduce a few new parts of Perl:



  • "Exploding" a string

  • Looking at specific locations in strings

  • Iterating over an array

  • Iterating over the length of a string

To get the count of each type of nucleotide in some DNA, you have to look at each base, see what it is, and then keep four counts, one for each nucleotide. We'll do this in two ways:

  • Explode the DNA into an array of single bases, and iterate over the array (that is, deal with the elements of the array one by one)

  • Use the substr Perl function to iterate over the positions in the string of DNA while counting

First, let's start with some pseudocode of the task. Afterwards, we'll make more detailed pseudocode, and finally write the Perl program for both approaches.

The following pseudocode describes generally what is needed:

for each base in the DNA

if base is A

count_of_A = count_of_A + 1

if base is C

count_of_C = count_of_C + 1

if base is G

count_of_G = count_of_G + 1

if base is T

count_of_T = count_of_T + 1

done
print count_of_A, count_of_C, count_of_G, count_of_T

As you can see, this is a pretty simple idea, mirroring what you'd do by hand if you had to. (If you want to count the relative frequencies of the bases in all human genes, you can't do it by hand—there are too many of them—and you have to use such a program. Thus bioinformatics.) Now let's see how it can be coded in Perl.


5.5 Exploding Strings into Arrays

Let's say you decide to explode the string of DNA into an array. By explode I mean separating out each letter in the string—sort of like blowing the string into bits. In other words, the letters representing the bases of the DNA in the string are separated, and each letter becomes its own scalar value in an array. Then you can look at the array elements (each of which is a single character) one by one, making the count as you go along. This is the inverse of the join function in Section 5.3.2, which takes an array of strings and makes a single scalar value out of them. (After exploding a string into an array, you could then join the array back into an identical string using join, if you so desire.)

I'm also adding to this version of the pseudocode the instructions to get the DNA from a file and manipulate that file data until it's a single string of DNA sequence. So first, you join the data from the array of lines of the original file data, clean it up by removing whitespace until only sequence is left, and then explode it back into an array. But, of course, the point is that the last array has exactly what is needed, the data in a convenient form to use in the counting loop. Instead of an array of lines, with newlines and possibly other unwanted characters, there's an exact array of the individual bases.

read in the DNA from a file


join the lines of the file into a single string $DNA
# make an array out of the bases of $DNA

@DNA = explode $DNA


# initialize the counts

count_of_A = 0

count_of_C = 0

count_of_G = 0

count_of_T = 0

for each base in @DNA

if base is A

count_of_A = count_of_A + 1

if base is C

count_of_C = count_of_C + 1

if base is G

count_of_G = count_of_G + 1

if base is T

count_of_T = count_of_T + 1

done
print count_of_A, count_of_C, count_of_G, count_of_T

As promised, this version of the pseudocode is a bit more detailed. It suggests a method to look at each of the bases by exploding the string of DNA into an array of single characters. It also initializes the counts to zero to ensure they start off right. It's easier to see what's happening if you spell out the initialization in the program, and it can prevent certain kinds of errors from creeping into your code. (It's not a rule, however; sometimes, you may prefer to leave the values of variables undefined until they are used.) Perl assumes that an uninitialized variable has the value 0 if you try to use it as a number, for instance by adding another number to it. But you'll most likely get a warning if that is the case.

We now have a design for the program, let's turn it into Perl code. Example 5-4 is a workable program; you'll see other ways to accomplish the same task more quickly as you proceed in this chapter, but speed is not the main concern at this point.

Example 5-4. Determining frequency of nucleotides

#!/usr/bin/perl -w

# Determining frequency of nucleotides
# Get the name of the file with the DNA sequence data

print "Please type the filename of the DNA sequence data: ";


$dna_filename = ;
# Remove the newline from the DNA filename

chomp $dna_filename;

# open the file, or exit

unless ( open(DNAFILE, $dna_filename) ) {

print "Cannot open file \"$dna_filename\"\n\n";

exit;


}
# Read the DNA sequence data from the file, and store it

# into the array variable @DNA

@DNA = ;
# Close the file

close DNAFILE;


# From the lines of the DNA file,

# put the DNA sequence data into a single string.

$DNA = join( '', @DNA);
# Remove whitespace

$DNA =~ s/\s//g;


# Now explode the DNA into an array where each letter of the

# original string is now an element in the array.

# This will make it easy to look at each position.

# Notice that we're reusing the variable @DNA for this purpose.

@DNA = split( '', $DNA );
# Initialize the counts.

# Notice that we can use scalar variables to hold numbers.

$count_of_A = 0;

$count_of_C = 0;

$count_of_G = 0;

$count_of_T = 0;

$errors = 0;
# In a loop, look at each base in turn, determine which of the

# four types of nucleotides it is, and increment the

# appropriate count.

foreach $base (@DNA) {


if ( $base eq 'A' ) {

++$count_of_A;

} elsif ( $base eq 'C' ) {

++$count_of_C;

} elsif ( $base eq 'G' ) {

++$count_of_G;

} elsif ( $base eq 'T' ) {

++$count_of_T;

} else {

print "!!!!!!!! Error - I don\'t recognize this base: $base\n";

++$errors;

}

}


# print the results

print "A = $count_of_A\n";

print "C = $count_of_C\n";

print "G = $count_of_G\n";

print "T = $count_of_T\n";

print "errors = $errors\n";


# exit the program

exit;

To demonstrate Example 5-4, I have created the following small file of DNA and called it small.dna:

AAAAAAAAAAAAAAGGGGGGGTTTTCCCCCCCC

CCCCCGTCGTAGTAAAGTATGCAGTAGCVG

CCCCCCCCCCGGGGGGGGAAAAAAAAAAAAAAATTTTTTAT

AAACG

The file small.dna can be typed into your computer using your favorite text editor, or you can download it from this book's web site.


Notice that there is a V in the file, an error.[4] Here is the output of Example 5-4:

[4] Files of DNA sequence data sometimes include such characters as N, meaning "some undetermined base," or other special characters. You sometimes have to look at the documentation for the source, say an ABI sequencer or a GenBank file or whatever, to discover which characters are used and what they mean.

Please type the filename of the DNA sequence data: small.dna

!!!!!!!! Error - I don't recognize this base: V
A = 40

C = 27


G = 24

T = 17


Now let's look at the new stuff in this program. Opening and reading the sequence data is the same as previous programs. The first new thing is at this line:

@DNA = split( '', $DNA);

which the comments say will explode the string $DNA into an array of single characters @DNA.

split
is the companion to join, and it's a good idea to take a little while to look over the documentation for these two commands. Calling split with an empty string as the first argument causes the string to explode into individual characters; that's just what we want.[5]

[5] As you'll see in the documentation for the split function, the first argument can be any regular expression, such as /\s+/ (one or more adjacent whitespace characters.)

Next, there are five scalar variables initialized to 0, the variables $count_of_A and so forth. I nitializing means assigning an initial value, in this case, the value 0.


Example 5-4 illustrates the concepts of type
and initialization. The type of a variable determines what kind of data it can hold, for instance, strings or numbers. Up to now we've been using scalar variables such as $DNA to store strings of letters such as A, C, G, and T. Example 5-4 shows that you can also use scalar variables to store numbers. For example, the variable $count_of_A keeps a running count of the character A.

Scalar variables can store integers (0, 1, -1, 2, -2, ...), decimal or floating-point numbers such as 6.544, and numbers in scientific notation such as 6.544E6, which translates as 6.544 x 106, or 6,544000. (See Appendix B for more details on types of numbers.)

In Example 5-4, the variables $count_of_A through $count_of_T are initialized to 0. Initializing a variable means giving it a value after it's declared. If you don't initialize your variables, they assume the value of 'undef'. In Perl, an undefined variable is 0 if it is asked for in numerical context; it's an empty string if used in a string operation. Although Perl programmers often choose not to initialize variables, it's a critical step in many other languages. In C for instance, uninitialized variables have unpredictable values. This can wreak havoc with your output. You should get in the habit of initializing variables; it makes the program easier to read and maintain, and that's important.

To declare a variable means to specify its name and other attributes such as an initial value and a scope (for scoping, see Chapter 6 and the discussion of my variables). Many languages require you to declare all variables before using them. For this book, up to now, declarations have been an unnecessary complication. The next chapter begins to require declarations. In Perl, you may declare a variable's scope (see Chapter 6 and the discussion of my variables) in addition to an initial value. Many languages also require you to declare the type of a variable, for example "integer," or "string," but Perl does not.

Perl is written to be smart about what's in a scalar variable. For instance, you can assign the number 1234 (without quotes) to a variable, or you can assign the string '1234' (with quotes). Perl treats the variable as a string for printing, and as a number for using in arithmetic operations, without your having to worry about it. Example 5-5 demonstrates this ability. In other words, Perl isn't strict about specifying the type of data a variable is used for.


Example 5-5. Demonstration of Perl's built-in knowledge about numbers and strings

#!/usr/bin/perl -w

# Demonstration of Perl's built-in knowledge about numbers and strings
$num = 1234;
$str = '1234';
# print the variables

print $num, " ", $str, "\n";


# add the variables as numbers

$num_or_str = $num + $str;


print $num_or_str, "\n";
# concatenate the variables as strings

$num_or_str = $num . $str;


print $num_or_str, "\n";
exit;

Example 5-5 produces the output:

1234 1234

2468

12341234

Example 5-5 illustrates the smart way Perl determines the datatype of a scalar variable, whether it's a string or a number, and whether you're trying to add or subtract it like a number or concatenate it like a string. Perl behaves accordingly, which makes your job as a programmer a little bit easier; Perl "does the right thing" for you.

Next is a new kind of loop, the foreach loop. This loop works over the elements of an array. The line:

foreach $base (@DNA) {

loops over the elements of the array @DNA, and each time through the loop, the scalar variable $base (or whatever name you choose) is set to the next element of the array.

The body of the loop checks for each base and increments the count for that base if found. There are four ways to add 1 to a number in Perl. Here, you put a ++ in front of the variable, like this:

++$count;

You can also put the ++ after the variable:

$count++;

You can spell it out like this, a combination of adding and assignment:

$count = $count + 1;

or, as a shorthand of that, you can say:

$count += 1;

Almost an embarrassment of riches. The plus-plus (++) notation is convenient for incrementing counts, as we're doing here. The plus-equals (+=) notation saves some typing and is very popular for adding other numbers besides 1.

The foreach loop in Example 5-5 could have been written like this:

foreach (@DNA) {

if ( /A/ ) {

++$count_of_A;

} elsif ( /C/ ) {

++$count_of_C;

} elsif ( /G/ ) {

++$count_of_G;

} elsif ( /T/ ) {

++$count_of_T;

} else {

print "!!!!!!!! Error - I don\'t recognize this base: ";

print;

print "\n";



++$errors;

}

}



This version of the foreach loop:

foreach(@DNA) {.

doesn't have a scalar value. In a foreach loop, if you don't specify a scalar variable to hold the scalars that are being read from the array ($base served that function in the version of this loop in Example 5-5), Perl uses the special variable $_ .

Furthermore, many Perl built-in functions operate on this special variable if no argument is provided to them. Here, the conditional tests are simply patterns; Perl assumes you're doing a pattern match on the $_ variable, so it behaves as if you had said $_ =~ /A/, for instance. Finally, in the error message, the statement print; prints the value of the $_ variable.

This special variable $_ that doesn't have to be named appears in many Perl programs, although I don't use it extensively in this book.

5.6 Operating on Strings

It's not necessary to explode a string into an array in order to look at each character. In fact, sometimes you'd want to avoid that. A large string takes up a large amount of memory in your computer. So does a large array. When you explode a string into an array, the original string is still there, and you also have to make a copy of each character for the elements of the new array you're creating. If you have a large string, that already uses a good portion of available memory, creating an additional array can cause you to run out of memory. When you run out of memory, your computer performs poorly; it can slow to a crawl, crash, or freeze ("hang"). These haven't been worrisome considerations up to now, but if you use large data sets (such as the human genome), you have to take these things into account.

So let's say you'd like to avoid making a copy of the DNA sequence data into another variable. Is there a way to just look at the string $DNA and count the bases from it directly? Yes. Here's some pseudocode, followed by a Perl program:

read in the DNA from a file

join the lines of the file into a single string of $DNA
# initialize the counts

count_of_A = 0

count_of_C = 0

count_of_G = 0

count_of_T = 0
for each base at each position in $DNA
if base is A

count_of_A = count_of_A + 1

if base is C

count_of_C = count_of_C + 1

if base is G

count_of_G = count_of_G + 1

if base is T

count_of_T = count_of_T + 1

done
print count_of_A, count_of_C, count_of_G, count_of_T

Example 5-6 shows a program that examines each base in a string of DNA.

Example 5-6. Determining frequency of nucleotides, take 2

#!/usr/bin/perl -w

# Determining frequency of nucleotides, take 2
# Get the DNA sequence data

print "Please type the filename of the DNA sequence data: ";


$dna_filename = ;
chomp $dna_filename;
# Does the file exist?

unless ( -e $dna_filename) {


print "File \"$dna_filename\" doesn\'t seem to exist!!\n";

exit;


}
# Can we open the file?

unless ( open(DNAFILE, $dna_filename) ) {


print "Cannot open file \"$dna_filename\"\n\n";

exit;


}
@DNA = ;
close DNAFILE;
$DNA = join( '', @DNA);
# Remove whitespace

$DNA =~ s/\s//g;

# Initialize the counts.

# Notice that we can use scalar variables to hold numbers.

$count_of_A = 0;

$count_of_C = 0;

$count_of_G = 0;

$count_of_T = 0;

$errors = 0;

# In a loop, look at each base in turn, determine which of the

# four types of nucleotides it is, and increment the

# appropriate count.

for ( $position = 0 ; $position < length $DNA ; ++$position ) {


$base = substr($DNA, $position, 1);
if ( $base eq 'A' ) {

++$count_of_A;

} elsif ( $base eq 'C' ) {

++$count_of_C;

} elsif ( $base eq 'G' ) {

++$count_of_G;

} elsif ( $base eq 'T' ) {

++$count_of_T;

} else {

print "!!!!!!!! Error - I don\'t recognize this base: $base\n";

++$errors;

}

}


# print the results

print "A = $count_of_A\n";

print "C = $count_of_C\n";

print "G = $count_of_G\n";

print "T = $count_of_T\n";

print "errors = $errors\n";


# exit the program

exit;


Here's the output of Example 5-6:

Please type the filename of the DNA sequence data: small.dna

!!!!!!!! Error - I don't recognize this vase: V

A = 40


C = 27

G = 24

T = 17

errors = 1

In Example 5-6, I added a line of code to see if the file exists:

unless ( -e $dna_filename) {

There are file test operators for several conditions; see Appendix B or Perl documentation under -X. Note that files have several attributes, such as size, permission, location in the filesystem, and type of file, and that many of these things can be tested for easily with the file test operators.

Notice, also, that I have kept the detailed comments about the regular expression, because regular expressions can be hard to read, and a little commenting here helps a reader to skim the code.

Everything else is familiar, until you hit the for loop; it requires a little explanation:

for ( $position = 0 ; $position < length $DNA ; ++$position ) {

# the statements in the block

}

This for loop is the equivalent of this while loop:



$position = 0;
while( $position < length $DNA ) {
# the same statements in the block, plus ...
++$position;

}

Take a moment and compare these two loops. You'll see the same statements but in different locations.


As you can see, the for loop brings the initialization and increment of a counter ($position) into the loop statement, whereas in the while loop, they are separate statements. In the for loop, both the initialization and the increment statement are placed between parentheses, whereas you find only the conditional test in the while
loop. In the for loop, you can put initializations before the first semicolon and increment statements after the second semicolon. The initialization statement is done just once before starting the loop, and the increment statement is done at the end of each iteration through the block before going back to the conditional test. It's really just a shorthand for the equivalent while loop as just shown.

The conditional test checks to see if the position reached in the string is less than the length of the string. It uses the length Perl function. Obviously, you don't want to check characters beyond the length of the string. But a word is in order here about the numbering of positions in strings and arrays.

By default, Perl assumes that a string begins at position 0 and its last character is at a position that's numbered one less than the length of the string. Why do it this way instead of numbering the positions from 1 up to and including the length of the string? There are reasons, but they're somewhat abstruse; see the documentation for enlightenment. If it's any comfort, many other programming languages make the same choice. (However, many do it the intuitive way, starting at 1. Ah well.)

This way of numbering is important to biologists because they are used to numbering sequences beginning with 1, not with 0 the way Perl does it. You sometimes have to add 1 to a position before printing out results so they'll make sense to nonprogrammers. It's mildly annoying, but you'll get used to it.

The same holds true for numbering the elements of an array. The first element of an array is element 0; the last is element $length-1.

Anyway, you see that the conditional test evaluates to true while the value of $position is length-1 or less and fails when $position reaches the same value as the length of the string. For example, say you have a string that contains the text "seeing." This has a length of six characters. The "s" is at position 0, and the "g" is at position 5, which is one less than the string length 6.

Back in the block, you call the substr function to look into the string:

$base = substr($DNA, $position, 1);

This is a fairly general-purpose function for working with strings; you can also insert and delete things. Here, you look at just one character, so you call substr on the string $DNA, ask it to look in position $position for one character, and save the result in scalar variable $base. Then you proceed to accumulate the count as in the preceding version of the program, Example 5-4.



1   ...   5   6   7   8   9   10   11   12   ...   28


The database is protected by copyright ©hestories.info 2017
send message

    Main page