Beginning Perl for Bioinformatics

Download 1.4 Mb.
Date conversion29.03.2017
Size1.4 Mb.
1   ...   6   7   8   9   10   11   12   13   ...   28

5.7 Writing to Files

Example 5-7 shows one more way to count nucleotides in a string of DNA. It uses a Perl trick that was designed with exactly this kind of job in mind. It puts a global regular expression search in the test for a while loop, and as you'll see, it's a compact way of counting characters in a string.

One of the nice things about Perl is that if you need to do something fairly regularly, the language has probably got a relatively succinct way to do it. (The downside of this is that Perl has a lot of things about it to learn.)

The results of Example 5-7, besides being printed to the screen, will also be written to a file. The code that accomplishes this writing to a file is as follows:

# Also write the results to a file called "countbase"

$outputfile = "countbase";


unless ( open(COUNTBASE, ">$outputfile") ) {

print "Cannot open file \"$outputfile\" to write to!!\n\n";


print COUNTBASE "A=$a C=$c G=$g T=$t errors=$e\n";

As you see, to write to a file, you do an open call, just as when reading from a file, but with a difference: you prepend a greater-than sign > to the filename. The filehandle becomes a first argument to a print statement (but without a comma following it). This makes the print statement direct its output into the file.[6]

[6] In this case, if the file already exists, it's emptied out first. It's possible to specify several other behaviors. As mentioned earlier, the Perl documentation has all the details of the open function, which sets the options for reading from, and writing to, files as well as other actions.

Example 5-7 is the third version of the Perl program that examines each base in a string of DNA.

Example 5-7. Determining frequency of nucleotides, take 3

#!/usr/bin/perl -w

# Determining frequency of nucleotides, take 3
# Get the DNA sequence data

print "Please type the filename of the DNA sequence data: ";

$dna_filename = ;
chomp $dna_filename;
# Does the file exist?

unless ( -e $dna_filename) {

print "File \"$dna_filename\" doesn\'t seem to exist!!\n";


# Can we open the file?

unless ( open(DNAFILE, $dna_filename) ) {

print "Cannot open file \"$dna_filename\"\n\n";


@DNA = ;
close DNAFILE;
$DNA = join( '', @DNA);
# Remove whitespace

$DNA =~ s/\s//g;

# Initialize the counts.

# Notice that we can use scalar variables to hold numbers.

$a = 0; $c = 0; $g = 0; $t = 0; $e = 0;
# Use a regular expression "trick", and five while loops,

# to find the counts of the four bases plus errors

while($DNA =~ /a/ig){$a++}

while($DNA =~ /c/ig){$c++}

while($DNA =~ /g/ig){$g++}

while($DNA =~ /t/ig){$t++}

while($DNA =~ /[^acgt]/ig){$e++}

print "A=$a C=$c G=$g T=$t errors=$e\n";

# Also write the results to a file called "countbase"

$outputfile = "countbase";

unless ( open(COUNTBASE, ">$outputfile") ) {
print "Cannot open file \"$outputfile\" to write to!!\n\n";


print COUNTBASE "A=$a C=$c G=$g T=$t errors=$e\n";
# exit the program


Example 5-7 looks like this when you run it:

Please type the filename of the DNA sequence data: small.dna

A=40 C=27 G=24 T=17 errors=1

The output file countbase has the following contents after you run Example 5-7:

A=40 C=27 G=24 T=17 errors=1

The while loop:

while($dna =~ /a/ig){$a++}

has as its conditional test, within the parentheses, a string-matching expression:

$dna =~ /a/ig

This expression is looking for the regular expression /a/, that is, the letter a. Since it has the i modifier, it's a case-insensitive match, which means it matches a or A. It also has the global modifier, which means match all the a's in the string. (Without the global modifier, it just keeps returning true every time through the loop, if there is an "a" in $dna.)

Now, this string-matching expression, in the context of a while loop, causes the while loop to execute its block on every match of the regular expression. So, append the one-statement block:


to increment the counter at each match of the regular expression; in other words, you're counting all the a's.

One other point should be made about this third version of the program. You'll notice some of the statements have been changed and shortened this time around. Some variables have shorter names, some statements are lined up on one line, and the print statement at the end is more concise. These are just alternative ways of writing. As you program, you'll find yourself experimenting with different approaches: try some on for size.

The way to count bases in this third version is flexible; for instance, it allows you to count non-ACGT characters without specifying them individually. In later chapters, you'll use those while loops to good effect. However, there's an even faster way to count bases. You can use the tr transliteration function from Chapter 4; it's faster, which is helpful if you have a lot of DNA to count:

$a = ($dna =~ tr/Aa//);

$c = ($dna =~ tr/Cc//);

$g = ($dna =~ tr/Gg//);

$t = ($dna =~ tr/Tt//);

The tr function returns the count of the specified characters it finds in the string, and if the set of replacement characters is empty, it doesn't actually change the string. So it makes a good character counter. Notice that with tr, you have to spell out the upper- and lowercase letters. Also, because tr doesn't accept character classes, there's no direct way to count nonbases. You could, however, say:

$basecount = ($dna = ~ tr/ACGTacgt//);

$ nonbase = (length $dna) - $basecount)

The program however, runs faster using tr than using the while loops of Example 5-7.

You may find it a bit much to have three (really, four) versions of this base-counting program, especially since much of the code in each version is identical. The only part of the program that really changed was the part that did the counting of the bases. Wouldn't it have been convenient to have a way to just alter the part that counts the bases? In Chapter 6, you'll see how subroutines allow you to partition your programs in just such a way.

5.8 Exercises

Exercise 5.1

Use a loop to write a nonhalting program. The conditional must always evaluate to true, every time through the loop. Note that some systems will catch that you're in an infinite loop and will stop the program automatically. You will stop your program differently, depending on which operating system you use. Ctrl-C works on Unix and Linux, a Windows MS-DOS command window, or a MacOS X shell window.

Exercise 5.2

Prompt the user to enter two (short) strings of DNA. Concatenate the two strings of DNA by appending the second to the first using the .= assignment operator. Print the two strings as concatenated, and then print the second string lined up over its copy at the end of the concatenated strings. For example, if the input strings are AAAA and TTTT, print:



Exercise 5.3

Write a program that prints all the numbers from 1 to 100. Your program should have much fewer than 100 lines of code!

Exercise 5.4

Write a program to calculate the reverse complement of a strand of DNA. Do not use the s/// or the tr functions. Use the substr function, and examine each base one at a time in the original while you build up the reverse complement. (Hint: you might find it easier to examine the original right to left, rather than left to right, although either is possible.)

Exercise 5.5

Write a program to report on the percentage of hydrophobic amino acids in a protein sequence. (To find which amino acids are hydrophobic, consult any introductory text on proteins, molecular biology, or cell biology. You will find information sources in Appendix A.)

Exercise 5.6

Write a program that checks if two strings given as arguments are reverse complements of each other. Use the Perl built-in functions split, pop, shift, and eq (eq actually an operator).

Exercise 5.7

Write a program to report how GC-rich some sequence is. (In other words, just give the percentage of G and C in the DNA.)

Exercise 5.8

Modify Example 5-3 to not only find motifs by regular expressions but to print out the motif that was found. For example, if you search, using regular expressions, for the motif EE.*EE, your program should print EETVKNDEE. You can use the special variable $&. After a successful pattern match, this special variable is set to hold the pattern that was matched.

Exercise 5.9

Write a program that switches two bases in a DNA string at specified positions. (Hint: you can use the Perl functions substr or slice.

Exercise 5.10

Write a program that writes a temporary file and then deletes it. The unlink function removes a file: just say, for example:

unlink "tmpfile";

but also check to see if unlink is successful.

Chapter 6. Subroutines and Bugs

In this chapter you'll extend your basic knowledge in two directions:

  • Subroutines

  • Using the Perl debugger

Subroutines are an important way to structure programs. You'll use them in Chapter 7, where you'll learn how to use randomization to simulate the mutation of DNA. The Perl debugger examines a program's behavior in "slow motion" and helps you find those pesky bugs.

6.1 Subroutines

Subroutines are an important way to organize a program and are used in all major programming languages.

A subroutine wraps up a bit of code, gives the code a name, and provides a way to pass in some values for its calculations and then report back the results. The rest of the program can then use the subroutine's code just by calling its name, giving the needed values to pass in to the subroutine code and then collecting the results. This use or "invocation" of a subroutine is commonly referred to as calling the subroutine. You can think of a subroutine as a program within a program; just as you run programs to get results, so your programs call subroutines to get results. Once you have a subroutine, you can use it in a program simply by knowing which values to pass in and what kind of values to expect it to pass out.

6.1.1 Advantages of Subroutines

Subroutines provide several benefits. They endow programs with abstraction, modularization, and the ability to create large programs by organizing the code into manageable chunks with defined inputs and outputs.

Say you need to calculate something, for instance the mean of a distribution at several places in a program or in several different programs. By writing this calculation as a subroutine, you can write it once, and then call it whenever you need it, thus making your program:

  • Shorter, since you're reusing the code.

  • Easier to test, since you can test the subroutine separately.

  • Easier to understand, since it reduces clutter and better organizes programs.
  • More reliable, since you have less code when you reuse subroutines, so there are fewer opportunities for something to go wrong.

  • Faster to write, since you may, for example, have already written some subroutines that handle basic statistics and can just call the one that calculates the mean without having to write it again. Or better yet, you found a good statistics library someone else wrote, and you never had to write it at all.

There is another subtle, yet powerful idea at work here. Subroutines can themselves call other subroutines, that is, a subroutine can use another subroutine for help in its calculations.[1] By writing a set of subroutines, each of which does one or a few things well, you can combine them in various ways to make new subroutines. You can then combine the new subroutines, and so on, and the end result can be large and flexible programming systems. Decomposing problems into sets of subroutines that can be conveniently combined allows you to create environments that can grow and adapt to changing conditions with a minimum of effort.

[1] Subroutines can even call themselves, and this so-called recursion can be an elegant way to compute (see Chapter 11).

The trick of all this is in how you partition the code into subroutines. You want subroutines that encapsulate something that will be generally useful, and not just called once (although that sometimes can be useful too). There are various rules of thumb: a subroutine should do one thing well, and it should be no more than a page or two of code. These are not real rules, and exceptions are frequent, but they can help you divide your code into manageable chunks, suitable for subroutines.

6.1.2 Writing Subroutines

Let's look at how subroutines are used and then at how they're defined.

To use a subroutine, you pass data into the subroutine as arguments, and then you collect the return value(s) of the subroutine. For example, say you want a subroutine that, given some DNA, appends "ACGT" to the end of the DNA and returns the new, longer DNA. Let's call the subroutine addACGT. In Perl, you usually call a subroutine by typing its name, followed by a parenthesized list of arguments (if any). For example, here's a call to addACGT with the one argument $dna:


When calling a subroutine, older versions of Perl required starting the name of a subroutine with the & (ampersand) character. It's still okay to do so (e.g., : &addACGT), but these days the ampersand is usually omitted.[2]

[2] There are times, even in the newer versions of Perl, when an ampersand is required; you'll see one such case in Chapter 11, in Section 11.2.3, which describes the File::Find module. (See also the defined and undef functions in the documentation or the perlref manpage).

Example 6-1 demonstrates a subroutine that shows in detail how this works.

Example 6-1. A subroutine to append ACGT to DNA

#!/usr/bin/perl -w

# A program with a subroutine to append ACGT to DNA
# The original DNA


# The call to the subroutine "addACGT".

# The argument being passed in is $dna; the result is saved in $longer_dna

$longer_dna = addACGT($dna);
print "I added ACGT to $dna and got $longer_dna\n\n";

# Subroutines for Example 6-1

# Here is the definition for subroutine "addACGT"
sub addACGT {

my($dna) = @_;

$dna .= 'ACGT';

return $dna;


Example 6-1 produces the following output:


We'll now look at this code to see how subroutines are defined and used in a Perl program.

The first thing to notice, taking the large view, is that the program now has two sections. The first section starts from the beginning of the program and ends with the exit command. Following that (and announced by a blizzard of comments for easy reading) is a section for subroutine definitions, in this case, only the one definition for subroutine addACGT. It is common to place all subroutine definitions together at the end of a program, for ease in reading. Usually they're listed alphabetically or in some other convenient way.

Actually, it is legal to put the subroutine definitions almost anywhere in a program. This is because Perl first scans through the code and does things like check the syntax and learn subroutine definitions, before it starts to run the program. In particular, subroutine definitions can come after the point in the code where you use them (not necessarily before, which many people assume is the rule), and they don't have to be grouped together but can be scattered throughout the code. But our method of collecting them together at the end can make reading a program much easier. The possible exception is when a small subroutine is used in one section of code, as sometimes happens with the sort function, for instance. In this case having the definition right there can save the reader paging back and forth between the subroutine definition and its use. Usually, it's more convenient to read the program without the subroutine definitions, to get the overall flow of the program first, and then go back and look into the subroutines, if necessary.

As you see, Example 6-1 is very simple. It first stores some DNA into the variable $dna and then passes that variable as an argument to the subroutine call, which looks like this: addACGT($dna). The subroutine is called by its name, followed by parentheses containing the arguments to the subroutine. There may be no arguments, or if more than one, they are separated by commas. The value returned by the subroutine can be saved; in this program the value is saved in a variable called $longer_dna, which is then printed, and the program exits.

The part of the program from the beginning to the exit statement is called variously the main program or the main body of the program. By looking over this section of the code, you can see what happens from the beginning to the end of the program without looking into the details of the subroutines.

Now that you've looked over the main program of Example 6-1, it's time to look at the subroutine definition and how it uses the principal of scoping.

6.2 Scoping and Subroutines

A subroutine is defined by the reserved word [3] for subroutine definitions, sub; the subroutine's name, in this case, addACGT; and a block, enclosed in a pair of matching curly braces. This is the same kind of block seen earlier in loops and conditional statements that groups statements together.

[3] A reserved word is a fundamental, defined word in the Perl language, such as if, while, foreach, or sub.

In Example 6-1, the name of the subroutine is addACGT, and the block is everything after the name. Here is the subroutine definition again:

sub addACGT {

my($dna) = @_;

$dna .= 'ACGT';

return $dna;


Now let's look into the block of the subroutine.

A subroutine is like a separate helper program for the main program, and it needs to have its own variables. You will use two types of variables in your subroutines in this book:[4]

[4] In the subroutines in this book, we won't use global variables, which can be seen by both the main program and the subroutines; nor will we use variables declared with local, which provides a different kind of scoping restriction than my.

  • Arguments passed in to the subroutine

  • Other variables declared with my and restricted to the scope of the subroutine

Arguments are the values given to a subroutine when it is used, or called. The values of the arguments are passed into the subroutine by means of the special variable @_, as you'll see in the next section.

Other variables a subroutine might use must be protected from interacting with variables in other parts of the program, so they have effect only within the subroutine's own scope. This is accomplished by declaring them as my variables, as will be explained shortly.

Finally, most subroutines return their results via the return function. This can return a single scalar as in return $dna; in our subroutine addACGT, in a list of scalars as in return ($dna1, $dna2);, in an array as in return @lines;, and more.

6.2.1 Arguments

To call a subroutine means to type its name and give it appropriate arguments and, usually, collect its results. Arguments , sometimes called parameters, usually contain the data that the subroutine computes on. In Example 6-1, this is the call of the subroutine addACGT with the argument $dna:

$longer_dna = addACGT($dna);

The essential point is that whenever you, the programmer, want to use a subroutine, you can call it with whatever argument(s) it is designed to accept and with which you need to compute (in this case, whatever DNA that needs ACGT appended to it) and the value of each argument appears in the subroutine in the @_ array.

When you call a subroutine with certain arguments, the names of the arguments you provide in the call are not important inside the subroutine. Only the values of those arguments that are actually passed inside the subroutine are important. The subroutine typically collects the values from the @_ array and assigns them to new variables that may or may not have the same names as the variables with which you called the subroutine. The only thing preserved is the order of the values, not the names of the variables containing the values.

Here's how it works. The first line in the subroutine's block is:

my($dna) = @_;

The values of the arguments from the call of the subroutine are passed into the subroutine in the special array variable @_. You know it's an array because it starts with the @ character. It has the brief name "_", and it's a special array variable that comes predefined in Perl programs. (It's not a name you should pick for your own arrays.) The array @_ contains all the scalar values passed into the subroutine. These scalar values are the values of the arguments to the subroutine. In this case, there is one scalar value: the string of DNA that's the value of the variable $dna passed in as an argument.

If the subroutine has more arguments—for instance one argument for DNA, one for the associated protein, and one for the name of the gene—they are all passed in and assigned to my variables inside the subroutine:

my($dna,$protein,$name_of_gene) = @_;

If there are no arguments, just omit that statement in the subroutine.

After the statement:

my($dna) = @_;

executes in the subroutine, the passed-in value is assigned to the subroutine's variable $dna. The next section explains why this is a new variable specific to the subroutine. The subroutine's variable can be called anything; it certainly doesn't have to be the same name as the argument, as it happens to be in this example. What's cool about scoping is that it doesn't matter if it is or not.

Beware the common mistake of forgetting the @_ array when naming your arguments in a subroutine, that is, using the statement:


instead of:

my($dna) = @_;

If you make this mistake, the values of the arguments won't appear in your subroutine, even though their names are declared.

6.2.2 Scoping

By keeping all variables a subroutine uses active only within the subroutine, you can make it safe to call the subroutines from anywhere. You make the variables specific only to the subroutine by declaring them as myvariables. my is a keyword defined in Perl that limits variables to the block in which they are used (in this case, the block is the subroutine).[5]

[5] There are different models of scoping; my implements a type called lexical scoping, also known as static scoping. Another method is available in Perl via the local construct, but you almost always want to use my.

Hiding variables and making them local to only a restricted part of a program, is called scoping. In Perl, using my variables is known as lexical scoping, and it's an essential part of modularizing your programs.

You declare that a variable is a myvariable like this:


my $x ;

or, combining the declaration with an initialization to a value:

my($x) = '49';

or, if you're collecting an argument within a subroutine:

my($x) = @_;

Once a variable is declared in this fashion, it exists only until the end of the block it was declared in. So in a subroutine, if you declare all your variables like this (both the arguments and any other variables), they are active only in the subroutine. If any variable has the same name as another variable elsewhere in the program, you don't have to worry, because the my declaration actually creates a new variable, active only in the enclosing block, and any other variable of the same name used elsewhere outside the block is kept separate.

The example that showed collecting an argument in a subroutine uses parentheses around the variable. Because @_ is an array, the parentheses around the new variables put them in array context and ensure that they are initialized correctly (see Chapter 4).

Always declare all your variables in your subroutines—even those variables that don't come in as arguments—such as the my construct.

Why use scoping? Example 6-2 shows the trouble that can happen when you don't. Recall that one of the advantages of subroutines is writing a useful bit of code once and then using it whenever you need it. Example 6-2 is a program that has a variable in the main program with the same name as a variable in a subroutine it calls. This can easily happen if you write the subroutine at a time other than the main program (say six months later) or if you call a subroutine someone else wrote.

1   ...   6   7   8   9   10   11   12   13   ...   28

The database is protected by copyright © 2017
send message

    Main page