Beginning Perl for Bioinformatics

Download 1.4 Mb.
Date conversion29.03.2017
Size1.4 Mb.
1   2   3   4   5   6   7   8   9   ...   28

4.2.1 Control Flow

Example 4-1 illustrates many of the ideas all our Perl programs will rely on. One of these ideas is control flow , or the order in which the statements in the program are executed by the computer.

Every program starts at the first line and executes the statements one after the other until it reaches the end, unless it is explicitly told to do otherwise. Example 4-1 simply proceeds from top to bottom, with no detours.

In later chapters, you'll learn how programs can control the flow of execution.

4.2.2 Comments Revisited

Now let's take a look at the parts of Example 4-1. You'll notice lots of blank lines. They're there to make the program easy for a human to read. Next, notice the comments that begin with the # sign. Remember from Chapter 3 that when Perl runs, it throws these away along with the blank lines. In fact, to Perl, the following is exactly the same program as Example 4-1:

#!/usr/bin/perl -w


In Example 4-1, I've made liberal use of comments. Comments at the beginning of code can make it clear what the program is for, who wrote it, and present other information that can be helpful when someone needs to understand the code. Comments also explain what each section of the code is for and sometimes give explanations on how the code achieves its goals.

It's tempting to belabor the point about the importance of comments. Suffice it to say that in most university-level, computer-science class assignments, the program without comments typically gets a low or failing grade; also, the programmer on the job who doesn't comment code is liable to have a short and unsuccessful career.

4.2.3 Command Interpretation

Because it starts with a # sign, the first line of the program looks like a comment, but it doesn't seem like a very informative comment:

#!/usr/bin/perl -w

This is a special line called command interpretation that tells the computer running Unix and Linux that this is a Perl program. It may look slightly different on different computers. On some machines, it's also unnecessary because the computer recognizes Perl from other information. A Windows machine is usually configured to assume that any program ending in .pl is a Perl program. In Unix or Linux, a Windows command window, or a MacOS X shell, you can type perl my_program, and your Perl program my_program won't need the special line. However, it's commonly used, so we'll have it at start all our programs.

Notice that the first line of code uses a flag -w. The "w" stands for warnings, and it causes Perl to print messages in case of an error. Very often the error message suggests the line number where it thinks the error began. Sometimes the line number is wrong, but the error is usually on or just before the line the message suggests. Later in the book, you'll also see the statement use warnings as an alternative to -w.

4.2.4 Statements

The next line of Example 4-1 stores the DNA in a variable:


This is a very common, very important thing to do in a computer language, so let's take a leisurely look at it. You'll see some basic features about Perl and about programming languages in general, so this is a good place to stop skimming and actually read.

This line of code is called a statement. In Perl, statements end in a semicolon (;). The use of the semicolon is similar to the use of the period in the English language.

To be more accurate, this line of code is an assignment statement. Its purpose in this program is to store some DNA into a variable called $DNA. There are several fundamental things happening here as you will see in the next sections. Variables

First, let's look at the variable $DNA. Its name is somewhat arbitrary. You can pick another name for it, and the program behaves the same way. For instance, if you replace the two lines:

print $DNA;

with these:


print $A_poem_by_Seamus_Heaney;

the program behaves in exactly the same way, printing out the DNA to the computer screen. The point is that the names of variables in a computer program are your choice. (Within certain restrictions: in Perl, a variable name must be composed from upper- or lowercase letters, digits, and the underscore _ character. Also the first character must not be a digit.)

This is another important point along the same lines as the remarks I've already made about using blank lines and comments to make your code more easily read by humans. The computer attaches no meaning to the use of the variable name $DNA instead of $A_poem_by_Seamus_Heaney, but whoever reads the program certainly will. One name makes perfect sense, clearly indicates what the variable is for in the program, and eases the chore of understanding the program. The other name makes it unclear what the program is doing or what the variable is for. Using well-chosen variable names is part of what's called self-documenting code. You'll still need comments, but perhaps not as many, if you pick your variable names well.

You've noticed that the variable name $DNA starts with dollar sign. In Perl this kind of variable is called a scalar variable, which is a variable that holds a single item of data. Scalar variables are used for such data as strings or various kinds of numbers (e.g., the string hello or numbers such as 25, 6.234, 3.5E10, -0.8373). A scalar variable holds just one item of data at a time. Strings

In Example 4-1, the scalar variable $DNA is holding some DNA, represented in the usual way by the letters A, C, G, and T. As stated earlier, in computer science a sequence of letters is called a string. In Perl you designate a string by putting it in quotes. You can use single quotes, as in Example 4-1, or double quotes. (You'll learn the difference later.) The DNA is thus represented by:


In Perl, to set a variable to a certain value, you use the = sign. The = sign is called the assignment operator . In Example 4-1, the value:


is assigned to the variable $DNA. After the assignment, you can use the name of the variable to get the value, as in the print statement in Example 4-1.

The order of the parts is important in an assignment statement. The value assigned to something appears to the right of the assignment operator. The variable that is assigned a value is always to the left of the assignment operator. In programming manuals, you sometimes come across the terms lvalue and rvalue to refer to the left and right sides of the assignment operator.

This use of the = sign has a long history in programming languages. However, it can be a source of confusion: for instance, in most mathematics, using = means that the two things on either side of the sign are equal. So it's important to note that in Perl, the = sign doesn't mean equality. It assigns a value to a variable. (Later, we'll see how to represent equality.)

So, to summarize what we've learned so far about this statement:


It's an assignment statement that sets the value of the scalar variable $DNA to a string representing some DNA. Print

The statement:

print $DNA;

prints ACGGGAGGACGGGAAAATTACTACGGCATTAGC out to the computer screen. Notice that the print statement deals with scalar variables by printing out their values—in this case, the string that the variable $DNA contains. You'll see more about printing later. Exit

Finally, the statement exit; tells the computer to exit the program. Perl doesn't require an exit statement at the end of a program; once you get to the end, the program exits automatically. But it doesn't hurt to put one in, and it clearly indicates the program is over. You'll see other programs that exit if something goes wrong before the program normally finishes, so the exit statement is definitely useful.

4.3 Concatenating DNA Fragments

Now we'll make a simple modification of Example 4-1 to show how to concatenate two DNA fragments. Concatenation is attaching something to the end of something else. A biologist is well aware that joining DNA sequences is a common task in the biology lab, for instance when a clone is inserted into a cell vector or when splicing exons together during the expression of a gene. Many bioinformatics software packages have to deal with such operations; hence its choice as an example.

Example 4-2 demonstrates a few more things to do with strings, variables, and print statements.

Example 4-2. Concatenating DNA

#!/usr/bin/perl -w

# Concatenating DNA

# Store two DNA fragments into two variables called $DNA1 and $DNA2



# Print the DNA onto the screen

print "Here are the original two DNA fragments:\n\n";

print $DNA1, "\n";
print $DNA2, "\n\n";
# Concatenate the DNA fragments into a third variable and print them

# Using "string interpolation"

$DNA3 = "$DNA1$DNA2";
print "Here is the concatenation of the first two fragments (version 1):\n\n";
print "$DNA3\n\n";
# An alternative way using the "dot operator":

# Concatenate the DNA fragments into a third variable and print them

$DNA3 = $DNA1 . $DNA2;
print "Here is the concatenation of the first two fragments (version 2):\n\n";
print "$DNA3\n\n";
# Print the same thing without using the variable $DNA3

print "Here is the concatenation of the first two fragments (version 3):\n\n";

print $DNA1, $DNA2, "\n";

As you can see, there are three variables here, $DNA1, $DNA2, and $DNA3. I've added print statements for a running commentary, so that the output of the program that appears on the computer screen makes more sense and isn't simply some DNA fragments one after the other.

Here's what the output of Example 4-2 looks like:

Here are the original two DNA fragments:



Here is the concatenation of the first two fragments (version 1):

Here is the concatenation of the first two fragments (version 2):

Here is the concatenation of the first two fragments (version 3):

Example 4-2 has many similarities to Example 4-1. Let's look at the differences. To start with, the print statements have some extra, unintuitive parts:

print $DNA1, "\n";

print $DNA2, "\n\n";

The print statements have variables containing the DNA, as before, but now they also have a comma and then "\n" or "\n\n". These are instructions to print newlines. A newline is invisible on the page or screen, but it tells the computer to go on to the beginning of the next line for subsequent printing. One newline, "\n", simply positions you at the beginning of the next line. Two new lines, "\n\n", moves to the next line and then positions you at the beginning of the line after that, leaving a blank line in between.

Look at the code for Example 4-2 and to make sure you see what these newline directives do to the output. A blank line is a line with nothing printed on it. Depending on your operating system, it may be just a newline character or a combination formfeed and carriage return (in which cases, it may also be called an empty line), or it may include nonprinting whitespace characters such as spaces and tabs. Notice that the newlines are enclosed in double quotes, which means they are parts of strings. (Here's one difference between single and double quotes, as mentioned earlier: "\n" prints a newline; '\n' prints \n as written.)

Notice the comma in the print statement. A comma separates items in a list. The print statement prints all the items that are listed. Simple as that.

Now let's look at the statement that concatenates the two DNA fragments $DNA1 and $DNA2 into the variable $DNA3:

$DNA3 = "$DNA1$DNA2";

The assignment to $DNA3 is just a typical assignment as you saw in Example 4-1, a variable name followed by the = sign, followed by a value to be assigned.

The value to the right of the assignment statement is a string enclosed in double quotes. The double quotes allow the variables in the string to be replaced with their values. This is called string interpolation .[2] So, in effect, the string here is just the DNA of variable $DNA1, followed directly by the DNA of variable $DNA2. That concatenation of the two DNA fragments is then assigned to variable $DNA3.

[2] There are occasions when you might add curly braces during string interpolation. The extra curly braces make sure the variable names aren't confused with anything else in the double-quoted string. For example, if you had variable $prefix and tried to interpolate it into the string I am $prefixinterested, Perl might not recognize the variable, confusing it with a nonexistent variable $prefixinterested. But the string I am ${prefix}interested is unambiguous to Perl.

After assigning the concatenated DNA to variable $DNA3, you print it out, followed by a blank line:

print "$DNA3\n\n";

One of the Perl catch phrases is, "There's more than one way to do it." So, the next part of the program shows another way to concatenate two strings, using the dot operator. The dot operator, when placed between two strings, creates a single string that concatenates the two original strings. So the line:

$DNA3 = $DNA1 . $DNA2;

illustrates the use of this operator.

An operator in a computer language takes some arguments—in this case, the strings $DNA1 and $DNA2—and does something to them, returning a value—in this case, the concatenated string placed in the variable $DNA3. The most familiar operators from arithmetic—plus, minus, multiply, and divide—are all operators that take two numbers as arguments and return a number as a value.

Finally, just to exercise the different parts of the language, let's accomplish the same concatenation using only the print statement:

print $DNA1, $DNA2, "\n";

Here the print statement has three parts, separated by commas: the two DNA fragments in the two variables and a newline. You can achieve the same result with the following print statement:

print "$DNA1$DNA2\n";

Maybe the Perl slogan should be, "There are more than two ways to do it."

Before leaving this section, let's look ahead to other uses of Perl variables. You've seen the use of variables to hold strings of DNA sequence data. There are other types of data, and programming languages need variables for them, too. In Perl, a scalar variable such as $DNA can hold a string, an integer, a floating-point number (with a decimal point), a boolean (true or false) value, and more. When it's required, Perl figures out what kind of data is in the variable. For now, try adding the following lines to Example 4-1 or Example 4-2, storing a number in a scalar variable and printing it out:

$number = 17;

print $number,"\n";

4.4 Transcription: DNA to RNA

A large part of what you, the Perl bioinformatics programmer, will spend your time doing amounts to variations on the same theme as Examples 4-1 and 4-2. You'll get some data, be it DNA, proteins, GenBank entries, or what have you; you'll manipulate the data; and you'll print out some results.

Example 4-3 is another program that manipulates DNA; it transcribes DNA to RNA. In the cell, this transcription of DNA to RNA is the outcome of the workings of a delicate, complex, and error-correcting molecular machinery.[3] Here it's a simple substitution. When DNA is transcribed to RNA, all the T's are changed to U's, and that's all that our program needs to know.[4]

[3] Briefly, the coding DNA strand is the reverse complement of the other strand, which is used as a template to synthesize its reverse complement as RNA, with T's replaced as U's. With the two reverse complements, this is the same as the coding strand with the TU

[4] We're ignoring the mechanism of the splicing out of introns, obviously. The T stands for thymine; the U stands for uracil.

Example 4-3. Transcribing DNA into RNA

#!/usr/bin/perl -w

# Transcribing DNA into RNA
# The DNA


# Print the DNA onto the screen

print "Here is the starting DNA:\n\n";

print "$DNA\n\n";

# Transcribe the DNA to RNA by substituting all T's with U's.

$RNA = $DNA;

$RNA =~ s/T/U/g;
# Print the RNA onto the screen

print "Here is the result of transcribing the DNA to RNA:\n\n";

print "$RNA\n";
# Exit the program.


Here's the output of Example 4-3:

Here is the starting DNA:

Here is the result of transcribing the DNA to RNA:

This short program introduces an important part of Perl: the ability to easily manipulate text data such as a string of DNA. The manipulations can be of many different sorts: translation, reversal, substitution, deletions, reordering, and so on. This facility of Perl is one of the main reasons for its success in bioinformatics and among programmers in general.

First, the program makes a copy of the DNA, placing it in a variable called $RNA:

$RNA = $DNA;

Note that after this statement is executed, there's a variable called $RNA that actually contains DNA.[5] Remember this is perfectly legal—you can call variables anything you like—but it is potentially confusing to have inaccurate variable names. Now in this case, the copy is preceded with informative comments and followed immediately with a statement that indeed causes the variable $RNA to contain RNA, so it's all right. Here's a way to prevent $RNA from containing anything except RNA:

[5] Recall the discussion in Section about the importance of the order of the parts in an assignment statement. Here, the value of $DNA, that is, the DNA sequence data that has been stored in the $DNA variable, is being assigned to the variable $RNA. If you had written $DNA = $RNA;, the value of the $RNA variable (which is empty) would have been assigned to the $DNA variable, in effect wiping out the DNA sequence data in that variable and leaving two empty variables.

($RNA = $DNA) =~ s/T/U/g;

In Example 4-3, the transcription happens in this statement:

$RNA =~ s/T/U/g;

There are two new items in this statement: the binding operator (=~) and the substitute command s/T/U/g.

The binding operator =~ is used, obviously enough, on variables containing strings; here the variable $RNA contains DNA sequence data. The binding operator means "apply the operation on the right to the string in the variable on the left."

The substitution operator , shown in Figure 4-1, requires a little more explanation. The different parts of the command are separated (or delimited) by the forward slash. First, the s indicates this is a substitution. After the first / comes a T, which represents the element in the string that will be substituted. After the second / comes a U, which represents the element that's going to replace the T. Finally, after the third / comes g. This g stands for "global" and is one of several possible modifiers that can appear in this part of the statement. Global means "make this substitution throughout the entire string," that is to say, everywhere possible in the string.

Figure 4-1. The substitution operator

Thus, the meaning of the statement is: "substitute all T's for U's in the string data stored in the variable $RNA."

The substitution operator is an example of the use of regular expressions. Regular expressions are the key to text manipulation, one of the most powerful features of Perl as you'll see in later chapters.

4.5 Using the Perl Documentation

A Perl programmer's most important resource is the Perl documentation. It should be installed on your computer, and it may also be found on the Internet at the Perl site. The Perl documentation may come in slightly different forms on your computer system, but the web version is the same for everybody. That's the version I refer to in this book. See the references in Appendix A for more discussion about different sources of Perl documentation.

Just to try it out, let's look up the print operator. First, open your web browser, and go to Then click on the Documentation link. Select "Perl's Builtin Functions" and then "Alphabetical Listing of Perl's Functions". You'll see a rather lengthy alphabetical listing of Perl's functions. Once you've found this page, you may want to bookmark it in your browser, as you may find yourself turning to it frequently. Now click on Print to view the print operator.

Check out the examples they give to see how the language feature is actually used. This is usually the quickest way to extract what you need to know.

Once you've got the documentation on your screen, you may find that reading it answers some questions but raises others. The documentation tends to give the entire story in a concise form, and this can be daunting for beginners. For instance, the documentation for the print function starts out: "Prints a string or a comma-separated list of strings. Returns TRUE if successful." But then comes a bunch of gibberish (or so it seems at this point in your learning curve!) Filehandles? Output streams? List context?

All this information is necessary in documentation; after all, you need to get the whole story somewhere! Usually you can ignore what doesn't make sense.

The Perl documentation also includes several tutorials that can be a great help in learning Perl. They occasionally assume more than a beginner's knowledge about programming languages, but you may find them very useful. Exploring the documentation is a great way to get up to speed on the Perl language.

4.6 Calculating the Reverse Complement in Perl

As you recall from Chapter 1, a DNA polymer is composed of nucleotides. Given the close relationship between the two strands of DNA in a double helix, it turns out that it's pretty straightforward to write a program that, given one strand, prints out the other. Such a calculation is an important part of many bioinformatics applications. For instance, when searching a database with some query DNA, it is common to automatically search for the reverse complement of the query as well, since you may have in hand the opposite strand of some known gene.

Without further ado, here's Example 4-4, which uses a few new Perl features. As you'll see, it first tries one method, which fails, and then tries another method, which succeeds.

1   2   3   4   5   6   7   8   9   ...   28

The database is protected by copyright © 2017
send message

    Main page