Beginning Perl for Bioinformatics


Chapter 5. Motifs and Loops



Download 1.4 Mb.
Page8/28
Date conversion29.03.2017
Size1.4 Mb.
1   ...   4   5   6   7   8   9   10   11   ...   28

Chapter 5. Motifs and Loops

This chapter continues demonstrating the basics of the Perl language begun in Chapter 4. By the end of the chapter, you will know how to:



  • Search for motifs in DNA or protein

  • Interact with users at the keyboard

  • Write data to files

  • Use loops

  • Use basic regular expressions

  • Take different actions depending on the outcome of conditional tests

  • Examine sequence data in detail by operating on strings and arrays

These topics, in addition to what you learned in Chapter 4, will give you the skills necessary to begin to write useful bioinformatics programs; in this chapter, you will learn to write a program that looks for motifs in sequence data.

5.1 Flow Control

Flow control is the order in which the statements of a program are executed. A program executes from the first statement at the top of the program to the last statement at the bottom, in order, unless told to do otherwise. There are two ways to tell a program to do otherwise: conditional statements and loops. A conditional statement executes a group of statements only if the conditional test succeeds; otherwise, it just skips the group of statements. A loop repeats a group of statements until an associated test fails.

5.1.1 Conditional Statements

Let's take another look at the open statement. Recall that if you try to open a nonexistent file, you get error messages. You can test for the existence of a file explicitly, before trying to open it. In fact, such tests are among the most powerful features of computer languages. The if , if-else, and unless conditional statements are three such testing mechanisms in Perl.

The main feature of these kinds of constructs is the testing for a conditional. A conditional evaluates to a true or false value. If the conditional is true, the statements following are executed; if the conditional is false, they are skipped (or vice versa).

However, "What is truth?" It's a question that programming languages may answer in slightly different ways.

This section contains a few examples that demonstrate some of Perl's conditionals. The true-false condition in each example is equality between two numbers. Notice that equality of numbers is represented by two equal signs ==, because the single equal sign = is already used for assignment to a variable.


Confusion between = for assignment and == for numeric equality is a frequent programming bug, so watch for it!


The following examples demonstrate whether the conditional will evaluate to true or false. You don't ordinarily have much use for such simple tests. Usually you test the values that have been read into variables or the return value of function calls—things you don't necessarily know beforehand.

The if statement with a true conditional:

if( 1 == 1) {
print "1 equals 1\n\n";

}

produces the output:



1 equals 1

The test is 1 == 1, or, in English, "Does 1 equal 1?" Since it does, the conditional evaluates to true, the statement associated with the if statement is executed, and a message is printed out.

You can also just say:

if( 1) {


print "1 evaluates to true\n\n";

}

which produces the output:



1 evaluates to true

The if statement with a false conditional:

if( 1 == 0) {

print "1 equals 0\n\n";

}

produces no output! The test is 1 == 0 or, in English, "Does 1 equal 0?" Since it doesn't, the conditional evaluates to false, the statements associated with the if statement aren't executed, and no message is printed out.


You can also just say:

if( 0 ) {

print "0 evaluates to true\n\n";

}

which produces no output, since 0 evaluates to false, so the statements associated with the if statement are skipped entirely.


There's another way to write short if statements that mirrors how the English language works. In English, you can say, equivalently, "If you build it, they will come" or "They will come if you build it." Not to be outdone, Perl also allows you to put the if after the action:

print "1 equals 1\n\n" if (1 == 1);

which does the same thing as the first example in this section and prints out:

1 equals 1

Now, let's look at an if-else statement with a true conditional:

if( 1 == 1) {

print "1 equals 1\n\n";

} else {


print "1 does not equal 1\n\n";

}

which produces the output:



1 equals 1

The if-else does one thing if the test evaluates to true and another if the test evaluates to false. Here is if-else with a false conditional:

if( 1 == 0) {

print "1 equals 0\n\n";

} else {

print "1 does not equal 0\n\n";

}

which produces the output:



1 does not equal 0

The final example is unless—the opposite of if. It works like the English word "unless": e.g., "Unless you study Russian literature, you are ignorant of Chekov." If the conditional evaluates to true, no action is taken; if it evaluates to false, the associated statements are executed. If "you study Russian literature" is false, "you are ignorant of Chekov."

unless( 1 == 0) {

print "1 does not equal 0\n\n";

}

produces the output:



1 does not equal 0

5.1.1.1 Conditional tests and matching braces

Two more comments are in order about these statements and their conditional tests.

First, there are several tests that can be used in the conditional part of these statements. In addition to numeric equality == as in the previous example, you can also test for inequality !=, greater than >, less than <, and more.

Similarly, you can test for string equality using the eq operator: if two strings are the same, it's true. There are also file test operators that allow you to test if a file exists, is empty, if permissions are set a certain way, and so on (see Appendix B). One common test is just a variable name: if the variable contains zero, it's considered false; any other number evaluates to true. If the variable contains a nonempty string, it evaluates to true; the empty string, designated by "" or '', is false.

Second, notice that the statements that follow the conditional are enclosed within a matching pair of curly braces. These statements within curly braces are called a block and arise frequently in Perl.[1] Matching pairs of parentheses, brackets, or braces, i.e., ( ), [ ], < >, and { }, are common programming features. Having the same number of left and right braces in the right places is essential for a Perl program to run correctly.


[1] As something of an oddity, the last statement within a block doesn't need a semicolon after it.

Matching braces are easy to lose track of, so don't be surprised if you miss some and get error messages when you try to run the program. This is a common syntax error; you have to go back and find the missing brace. As code gets more complex, it can be a challenge to figure out where the matching braces are wrong and how to fix them. Even if the braces are in the right place, it can be hard to figure out what statements are grouped together when you're reading code. You can avoid this problem by writing code that doesn't try to do too much on any one line and uses indentation to further highlight the blocks of code (see Section 5.2).[2]



[2] Some text editors help you find a matching brace (for instance, in the vi editor, hitting the percent key % over a parenthesis bounces you to the matching parenthesis).

Back to the conditional statements. The if-else also has an if-elsif-else form, as in Example 5-1. The conditionals, first the if and then the elsifs, are evaluated in turn, and as soon as one evaluates to true, its block is executed, and the rest of the conditionals are ignored. If none of the conditionals evaluates to true, the else block is executed if there is one—it's optional.



Example 5-1. if-elsif-else

#!/usr/bin/perl -w

# if-elsif-else
$word = 'MNIDDKL';
# if-elsif-else conditionals

if($word eq 'QSTVSGE') {


print "QSTVSGE\n";
} elsif($word eq 'MRQQDMISHDEL') {
print "MRQQDMISHDEL\n";
} elsif ( $word eq 'MNIDDKL' ) {

print "MNIDDKL--the magic word!\n";

} else {
print "Is \"$word\" a peptide? This program is not sure.\n";
}
exit;

Notice the \" in the else block's print statement; it lets you print a double-quote sign (") within a double-quoted string. The backslash character tells Perl to treat the following " as the sign itself and not interpret it as the marker for the end of the string. Also note the use of eq to check for equality between strings.



Example 5-1 gives the output:

MNIDDKL--the magic word!



5.1.2 Loops

A loop allows you to repeatedly execute a block of statements enclosed within matching curly braces. There are several ways to loop in Perl: while loops, for loops, foreach loops, and more. Example 5-2 (from Chapter 4) displays the while loop and how it's used while reading protein sequence data in from a file.



Example 5-2. Reading protein sequence data from a file, take 4

#!/usr/bin/perl -w

# Reading protein sequence data from a file, take 4
# The filename of the file containing the protein sequence data

$proteinfilename = 'NM_021964fragment.pep';


# First we have to "open" the file, and in case the

# open fails, print an error message and exit the program.

unless ( open(PROTEINFILE, $proteinfilename) ) {
print "Could not open file $proteinfilename!\n";

exit;


}

# Read the protein sequence data from the file in a "while" loop,

# printing each line as it is read.

while( $protein =

) {
print " ###### Here is the next line of the file:\n";
print $protein;

}
# Close the file.

close PROTEINFILE;
exit;

Here's the output of Example 5-2:

###### Here is the next line of the file:

MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGQSTVSGELQD

###### Here is the next line of the file:

SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDELMVHEETVKNDEEQMETHERLPQ

###### Here is the next line of the file:

GLQYALNVPISVKQEITFTDVSEQLMRDKKQIR

In the while loop, notice how the variable $protein is assigned each time through the loop to the next line of the file. In Perl, an assignment returns the value of the assignment. Here, the test is whether the assignment succeeds in reading another line. If there is another line to read in, the assignment occurs, the conditional is true, the new line is stored in the variable $protein, and the block with the two print statements is executed. If there are no more lines, the assignment is undefined, the conditional is false, and the program skips the block with the two print statements, quits the while loop, and continues to the following parts of the program (in this case, the close and exit functions).

5.1.2.1 open and unless

The open call is a system call, because to open a file, Perl must ask for the file from the operating system. The operating system may be a version of Unix or Linux, a Microsoft Windows versions, one of the Apple Macintosh operating systems, and so on. Files are managed by the operating system and can be accessed only by it.

It's a good habit to check for the success or failure of system calls, especially when opening files. If a system call fails, and you're not checking for it, your program will continue, perhaps attempting to read or write to a file you couldn't open in the first place. You should always check for failure and let the user of the program know right away when a file can't be opened. Often you may want to exit the program on failure or try to open a different file.

In Example 5-2, the open system call is part of the test of the unless conditional.


unless
is the opposite of if. Just as in English you can say "do the statements in the block if the condition is true"; you can also say the opposite, "do the statements in the block unless the condition is true." The open system call gives you a true value if it successfully opens the file; so here, in the conditional test of the unless statement, if the open call fails, the statements in the block are performed, the program prints an error message, and then exits.

To sum up, conditionals and loops are simple ideas and not difficult to learn in Perl. They are among the most powerful features of programming languages. Conditionals allow you to tailor a program to several alternatives, and in that way, make decisions based on the type of input it gets. They are responsible for a large part of whatever artificial intelligence there is in a computer program. Loops harness the speed of the computer so that in a few lines of code, you can handle large amounts of input or continually iterate and refine a computation.



5.2 Code Layout

Once you start using loops and conditional statements, you need to think seriously about formatting. You have many options when formatting Perl code on the page. Compare these variant ways of formatting an if statement inside a while loop:



Format A

while ( $alive ) {

if ( $needs_nutrients ) {

print "Cell needs nutrients\n";

}

}

Format B



while ( $alive )

{

if ( $needs_nutrients )



{

print "Cell needs nutrients\n";

}

}

Format C



while ( $alive )

{

if ( $needs_nutrients )



{

print "Cell needs nutrients\n";

}

}

Format D


while($alive){if($needs_nutrients){print "Cell needs nutrients\n";}}

These code fragments are equivalent as far as the Perl interpreter is concerned. That's because Perl doesn't rely on how the statements are laid out on the lines; Perl cares only about the correct order of the syntactical elements. Some elements need some whitespace (such as spaces, tabs, or newlines) between them to make them distinct, but in general, Perl doesn't restrict how you use whitespace to lay out your code.

Formats A and B are common ways to lay out code. They both make the program structure clear to the human reading it. Notice how the statements that have a block associated with them—the while and if statements—line up the curly braces and indent the statements within the blocks. These layouts make clear the extent of the block associated with the statements. (This can be critical for long, complicated blocks.) The statements inside the blocks are indented, for which you normally use the Tab key or groups of four or eight spaces. (Many text editors allow you to insert spaces when you hit the Tab key, or you can instruct them to set the tab stops at four, eight, or whatever number of spaces.) The overall structure of the program becomes clearer this way; you can easily see which statements are grouped in a block and associated with a given loop or conditional. Personally, I prefer the layout in Format A, although I'm also perfectly happy with Format B.

Format C is an example of badly formatted code. The flow control of the code isn't clear; for instance, it's hard to see if the print statement is in the block of the while statement.

Format D demonstrates how hard it is to read code with essentially no formatting, even a simple fragment like this.

The Perl style guide, available from the main Perl manual page or from the command line by typing:

perldoc perlstyle

has some recommendations and some suggestions for ways to write readable code. However, they are not rules, and you may use your own judgment as to the formatting practices that work best for you.



5.3 Finding Motifs

One of the most common things we do in bioinformatics is to look for motifs, short segments of DNA or protein that are of particular interest. They may be regulatory elements of DNA or short stretches of protein that are known to be conserved across many species. (The PROSITE web site at http://www.expasy.ch/prosite/ has extensive information about protein motifs.)

The motifs you look for in biological sequences are usually not one specific sequence. They may have several variants—for example, positions in which it doesn't matter which base or residue is present. They may have variant lengths as well. They can often be represented as regular expressions, which you'll see more of in the discussion following Example 5-3, in Chapter 9, and elsewhere in the book.

Perl has a handy set of features for finding things in strings. This, as much as anything, has made it a popular language for bioinformatics. Example 5-3 introduces this string-searching capability; it does something genuinely useful, and similar programs are used all the time in biology research. It does the following:



  • Reads in protein sequence data from a file

  • Puts all the sequence data into one string for easy searching

  • Looks for motifs the user types in at the keyboard

Example 5-3. Searching for motifs

#!/usr/bin/perl -w

# Searching for motifs

# Ask the user for the filename of the file containing

# the protein sequence data, and collect it from the keyboard

print "Please type the filename of the protein sequence data: ";
$proteinfilename = ;
# Remove the newline from the protein filename

chomp $proteinfilename;


# open the file, or exit

unless ( open(PROTEINFILE, $proteinfilename) ) {


print "Cannot open file \"$proteinfilename\"\n\n";

exit;


}
# Read the protein sequence data from the file, and store it

# into the array variable @protein

@protein =
;
# Close the file - we've read all the data into @protein now.

close PROTEINFILE;


# Put the protein sequence data into a single string, as it's easier

# to search for a motif in a string than in an array of

# lines (what if the motif occurs over a line break?)

$protein = join( '', @protein);


# Remove whitespace

$protein =~ s/\s//g;


# In a loop, ask the user for a motif, search for the motif,

# and report if it was found.

# Exit if no motif is entered.

do {


print "Enter a motif to search for: ";
$motif = ;
# Remove the newline at the end of $motif
chomp $motif;
# Look for the motif
if ( $protein =~ /$motif/ ) {
print "I found it!\n\n";
} else {
print "I couldn\'t find it.\n\n";

}
# exit on an empty user input

} until ( $motif =~ /^\s*$/ );
# exit the program

exit;


Here's some typical output from Example 5-3:

Please type the filename of the protein sequence data: NM_021964fragment.pep

Enter a motif to search for: SVLQ

I found it!


Enter a motif to search for: jkl

I couldn't find it.

Enter a motif to search for: QDSV

I found it!

Enter a motif to search for: HERLPQGLQ

I found it!


Enter a motif to search for:

I couldn't find it.

As you see from the output, this program finds motifs that the user types in at the keyboard. With such a program, you no longer have to search manually through potentially huge amounts of data. The computer does the work and does it much faster and more accurately than a human.

It'd be nice if this program not only reported it found the motif but at what position. You'll see how this can be accomplished in Chapter 9. An exercise in that chapter challenges you to modify this program so that it reports the positions of the motifs.

The following sections examine and discuss the parts of Example 5-3 that are new:


  • Getting user input from the keyboard

  • Joining lines of a file into a single scalar variable

  • Regular expressions and character classes

  • do-until loops

  • Pattern matching

5.3.1 Getting User Input from the Keyboard

You first saw filehandles in Example 4-5. In Example 5-3 (as was true in Example 4-3), a filehandle and the angle bracket input operator are used to read in data from an opened file into an array, like so:

@protein =

;

Perl uses the same syntax to get input that is typed by the user at the keyboard. In Example 5-3, a special filehandle called STDIN (short for standard input), is used for this purpose, as in this line that collects a filename from the user:

$proteinfilename = ;

So, a filehandle can be associated with a file; it can also be associated with the keyboard where the user types responses to questions the program asks.

If the variable you're using to save the input is a scalar variable starts with a dollar sign $), as in this fragment, only one line is read, which is almost always what you want in this case.

In Example 5-3, the user is requested to enter the filename of a file containing protein sequence data. After getting a filename in this fashion, there's one more step before you can open the file. When the user types in a filename and sends a newline by hitting the Enter key (also known as the Return key), the filename also gets a newline character at the end as it is stored in the variable. This newline is not part of the filename and has to be removed before the open system call will work. The Perl function chomp removes newlines (or its cousins linefeeds and carriage returns) from the end of a string. (The older function chop removes the last character, no matter what it is; this caused trouble, so chomp was introduced and is almost always preferred.)

So this part of Perl requires a little bit extra: removing the newline from the input collected from the user at the keyboard. Try commenting out the chomp function, and you'll see that the open fails, because no filename has a newline at the end. (Operating systems have rules as to which characters are allowed in filenames.)
5.3.2 Turning Arrays into Scalars with join

It's common to find protein sequence data broken into short segments of 80 or so characters each. The reason is simple: when data is printed out on paper or displayed on the screen, it needs to be broken up into lines that fit into the space. Having your data broken into segments, however, is inconvenient for your Perl program. What if you're searching for a motif that's split by a newline character? Your program won't find it. In fact, some of the motifs searched for in Example 5-3 are split by line breaks. In Perl you deal with this sort of segmented data with the Perl function join. In Example 5-3 join collapses an array @protein by combining all the lines of data into a single string stored in a new scalar variable $protein:

$protein = join( '', @protein);

You specify a string to be placed between the elements of the array as they're joined. In this case, you specify the empty string to be placed between the lines of the input file. The empty string is represented with the pair of single quotes '' (double quotes "" also serve).

Recall that in Example 4-2, I introduced several equivalent ways to concatenate two fragments of DNA. The use of the join function is very similar. It takes the scalar values that are the elements of the array and concatenates them into a single scalar value. Recall the following statement from Example 4-2, which is one of the equivalent ways to concatenate two strings:

$DNA3 = $DNA1 . $DNA2;

Another way to accomplish the same concatenation uses the join function:

$DNA3 = join( "", ($DNA1, $DNA2) );

In this version, instead of giving an array name, I specify a list of scalar elements:

($DNA1, $DNA2)




1   ...   4   5   6   7   8   9   10   11   ...   28


The database is protected by copyright ©hestories.info 2017
send message

    Main page