Beginning Perl for Bioinformatics

Logical Operators and the Range Operator

Download 1.4 Mb.
Date conversion29.03.2017
Size1.4 Mb.
1   ...   15   16   17   18   19   20   21   22   ...   28

9.2.4 Logical Operators and the Range Operator

You're using a foreach loop to process the lines of the bionet file stored in the @rebasefile array.

Within that loop you use a new feature of Perl to skip the header lines, called the range operator (..), which is used in this line:

( 1 .. /Rich Roberts/ ) and next;

This has the effect of skipping everything from the first line up to and including the line with "Rich Roberts," in other words, the header lines. (Range operators must have at least one of their endpoints given as a number to work like this.)

The and function is a logical operator. Logical operators are available in most programming languages. In Perl they've become very popular, so although we haven't used them a great deal in this book, you'll often come across code that does. In fact, you'll start to see them a bit more as the book continues.

Logical operators can test if two conditions are both true, for instance:

if( $string eq 'kinase' and $num == 3) {



Only if both the conditions are true is the entire statement true.

Similarly, with logical operators you can test if at least one of the conditions is true using the or operator, for instance:

if( $string eq 'kinase' or $num == 3) {



Here, the if statement is true if either or both of the conditionals are true.

There is also the not logical operator, a negation operator with which you can test if something is false:

if( not 6 == 9 ) {



6 == 9 returns false, which is negated by the not operator, so the entire conditional returns true.

There are also the closely related operators, && for and
, || for or, and ! for not. These have slightly different behavior (actually, different precedence); most Perl code uses the versions I've shown, but both are common.

When in doubt about precedence, you can always parenthesize expressions to ensure your statement means what you intend it to mean. (See Section 9.3.1 later in this chapter.)

Logical operators also have an order of evaluation, which makes them useful for controlling the flow of programs. Let's take a look at how the and operator evaluates its two arguments. It first evaluates the left argument, and if it's true, evaluates and returns the right. If the left argument evaluates to false, the right argument is never touched. So the and operator can act like a mini if statement. For instance, the following two examples are equivalent:

if( $verbose ) {

print $helpful_but_verbose_message;

$verbose and print $helpful_but_verbose_message;

Of course, the if statement is more flexible, because it allows you to easily add more statements to the block, and elsif and else conditions to their own blocks. But for simple situations, the and operator works well.[1]

[1] You can even chain logical operators one after the other to build up more complicated expressions and use parentheses to group them. Personally, I don't like that style much, but in Perl, there's more than one way to do it!

The logical operator or evaluates and returns the left argument if it's true; if the left argument doesn't evaluate to true, the or operator then evaluates and returns the right argument. So here's another way to write a one-line statement that you'll often see in Perl programs:

open(MYFILE, $file) or die "I cannot open file $file: $!";

This is basically equivalent to our frequent:

unless(open(MYFILE, $file)) {

print "I cannot open file $file\n";



Let's go back and take a look at the parseREBASE subroutine with the line:

( 1 .. /Rich Roberts/ ) and next;

The left argument is the range 1 .. /Rich Roberts/. When you're in that range of lines, the range operator returns a true value. Because it's true, the and boolean operator goes on to see if the value on the other side is true and finds the next function, which evaluates to true, even as it takes you back to the "next" iteration of the enclosing foreach loop. So if you're between the first line and the Rich Roberts line, you skip the rest of the loop.

Similarly, the line:

/^\s*$/ and next;

takes you back to the next iteration of the foreach if the left argument, which matches a blank line, is true.

The other parts of this parseREBASE subroutine have already been discussed, during the design phase.

9.2.5 Finding the Restriction Sites

So now it's time to write a main program and see our code in action. Let's start with a little pseudocode to see what still needs to be done:


# Get DNA




# Get the REBASE data into a hash, from file "bionet"



for each user query
If query is defined in the hash

Get positions of query in DNA

Report on positions, if any


You now need to write a subroutine that finds the positions of the query in the DNA. Remember that trick of putting a global search in a while loop from Example 5-7 and take heart. No sooner said than:

Given arguments $query and $dna

while ( $dna =~ /$query/ig ) {

save the position of the match

return @positions

When you used this trick before, you just counted how many matches there were, not what the positions were. Let's check the documentation for clues, specifically the list of built-in functions in the documentation. It looks like the pos function will solve the problem. It gives the location of the last match of a variable in an m//g search. Example 9-3 shows the main program followed by the required subroutine. It's a simple subroutine, given the Perl functions like pos that make it easy.

Example 9-3. Make restriction map from user queries


# Make restriction map from user queries on names of restriction enzymes
use strict;

use warnings;

use BeginPerlBioinfo; # see Chapter 6 about this module
# Declare and initialize variables

my %rebase_hash = ( );

my @file_data = ( );

my $query = '';

my $dna = '';

my $recognition_site = '';

my $regexp = '';

my @locations = ( );

# Read in the file "sample.dna"

@file_data = get_file_data("sample.dna");

# Extract the DNA sequence data from the contents of the file "sample.dna"

$dna = extract_sequence_from_fasta_data(@file_data);

# Get the REBASE data into a hash, from file "bionet"

%rebase_hash = parseREBASE('bionet');

# Prompt user for restriction enzyme names, create restriction map

do {

print "Search for what restriction site for (or quit)?: ";

$query = ;

chomp $query;
# Exit if empty query

if ($query =~ /^\s*$/ ) {



# Perform the search in the DNA sequence

if ( exists $rebase_hash{$query} ) {
($recognition_site, $regexp) = split ( " ", $rebase_hash{$query});
# Create the restriction map

@locations = match_positions($regexp, $dna);

# Report the restriction map to the user

if (@locations) {

print "Searching for $query $recognition_site $regexp\n";

print "A restriction site for $query at locations:\n";

print join(" ", @locations), "\n";

} else {

print "A restriction site for $query is not in the DNA:\n";



print "\n";

} until ( $query =~ /quit/ );



# Subroutine


# Find locations of a match of a regular expression in a string



# return an array of positions where the regular expression

# appears in the string

sub match_positions {

my($regexp, $sequence) = @_;
use strict;
use BeginPerlBioinfo; # see Chapter 6 about this module

# Declare variables

my @positions = ( );

# Determine positions of regular expression matches


while ( $sequence =~ /$regexp/ig ) {

push ( @positions, pos($sequence) - length($&) + 1);

return @positions;


Here is some sample output from Example 9-3:

Search for what restriction enzyme (or quit)?: AceI

Searching for AceI G^CWGC GC[AT]GC

A restriction site for AceI at locations:

54 94 582 660 696 702 840 855 957

Search for what restriction enzyme (or quit)?: AccII

Searching for AccII CG^CG CGCG

A restriction site for AccII at locations:


Search for what restriction enzyme (or quit)?: AaeI

A restriction site for AaeI is not in the DNA:
Search for what restriction enzyme (or quit)?: quit

Notice the length($&) in the subroutine match_positions. That $& is a special variable that's set after a successful regular-expression match. It stands for the sequence that matched the regular expression. Since pos gives the position of the first base following the match, you have to subtract the length of the matching sequences, plus one (to make the bases start at position 1 instead of position 0) to report the starting position of the match. Other special variables include $` which contains everything in the string before the successful match; and $´, which contains everything in the string after the successful match. So, for example: '123456' =~ /34/ succeeds at setting these special variables like so: $`= '12', $& = '34', and $´ = '56'.

What we have here is admittedly bare bones, but it does work. See the exercises at the end of the chapter for ways to extend this code.

9.3 Perl Operations

We've made it pretty far in this introductory programming book without talking about basic arithmetic operations, because you haven't really needed much more than addition to increment counters.

However, an important part of any programming language, Perl included, is the ability to do mathematical calculations. Look at Appendix B, which shows the basic operations available in Perl.

9.3.1 Precedence of Operations and Parentheses

Operations have rules of precedence. These enable the language to decide which operations should be done first when there are a few of them in a row. The order of operations can change the result, as the following example demonstrates.

Say you have the code 8 + 4 / 2. If you did the division first, you'd get 8 + 2, or 10.However, if you did the addition first, you'd get 12 / 2, or 6.

Now programming languages assign precedences to operations. If you know these, you can write expressions such as 8 + 4 / 2, and you'd know what to expect. But this is a slippery slope.

For one thing, what if you get it wrong? Or, what if someone else has to read the code who doesn't have the memorization powers you do? Or, what if you memorize it for one language and Perl does it differently? (Different languages do indeed have different precedence rules.)

There is a solution, and it's called using parentheses. For Example 9-3, if you simply add parentheses: (8 + ( 4 / 2 )), it's clear to you, other readers, and the Perl program, that you want to do the division first. Note that "inner" parentheses, contained within another pair of parentheses, are evaluated first.

Remember to use parentheses in complicated expressions to specify the order of operations. Among other things, it will save you some long debugging sessions!

9.4 Exercises

Exercise 9.1

Modify Example 9-3 to accept DNA from the command line; if it's not specified there, prompt the user for a FASTA filename and read in the DNA sequence data.

Exercise 9.2

Modify Exercise 9.1 to read in, and make a hash of, the entire REBASE restriction site data from the bionet file.

Exercise 9.3

Modify Exercise 9.2 to store the REBASE hash created in a DBM file if it doesn't exist or to use the DBM file if it does exist. (Look ahead to Chapter 10 for more information about DBM.)

Exercise 9.4

Modify Example 5-3 to report on the locations of the motifs that it finds, even if motif appears multiple times in the sequence data.

Exercise 9.5

Include a graphic display of the cut sites in the restriction map by printing the sequence and labeling the recognition sites with the enzyme name. Can you make a map that handles multiple restriction enzymes? How can you handle overlapping restriction sites?

Exercise 9.6

Write a subroutine that returns a restriction digest, the fragments of DNA left after performing a restriction reaction. Remember to take into account the location of the cut site. (This requires you to parse the REBASE bionet in a different manner. You may, if you wish, ignore restriction enzymes that are not given with a ^ indicating a cut site.)

Exercise 9.7

Extend the restriction map software to take into account the opposite strand for nonpalindromic recognition sites.

Exercise 9.8

Given an arithmetic expression without parentheses, write a subroutine that adds the appropriate parentheses to conform to Perl's precedence rules. (Warning: this is a pretty hard exercise and should be skipped by all but the true believers who have extra time on their hands. See the Perl documentation for the precedence rules.)

Chapter 10. GenBank

GenBank (Genetic Sequence Data Bank) is a rapidly growing international repository of known genetic sequences from a variety of organisms. Its use is central to modern biology and to bioinformatics.

This chapter shows you how to write Perl programs to extract information from GenBank files and libraries. Exercises include looking for patterns; creating special libraries; and parsing the flat-file format to extract the DNA, annotation, and features. You will learn how to make a DBM database to create your own rapid-access lookups on selected data in a GenBank library.

Perl is a great tool for dealing with GenBank files. It enables you to extract and use any of the detailed data in the sequence and in the annotation, such as in the FEATURES table and elsewhere. When I first started using Perl, I wrote a program that searched GenBank for all sequence records annotated as being located on human chromosome 22. I found many genes where that information was so deeply buried within the annotation, that the major gene mapping database, Genome Database (GDB), hadn't included them in their chromosome map. I think you'll discover the same feeling of power over the information when you start applying Perl to GenBank files.

Most biologists are familiar with GenBank. Researchers can perform a search, e.g., a BLAST search on some query sequence, and collect a set of GenBank files of related sequences as a result. Because the GenBank records are maintained by the individual scientists who discovered the sequences, if you find some new sequence of interest, you can publish it in GenBank.

GenBank files have a great deal of information in them in addition to sequence data, including identifiers such as accession numbers and gene names, phylogenetic classification, and references to published literature. A GenBank file may also include a detailed FEATURES table that summarizes facts about the sequence, such as the location of the regulatory regions, the protein translation, and exons and introns.

GenBank is sometimes referred to as a databank or data store, which is different from a database. Databases typically have a relational structure imposed upon the data, including associated indices and links and a query language. GenBank in comparison is a flat file, that is, an ASCII text file that is easily readable by humans.[1]

[1] GenBank is also distributed in ASN.1 format, for which you need specialized tools, provided by NCBI.

From its humble beginnings GenBank has rapidly grown, and the flat-file format has seen signs of strain during the growth. With a quickly advancing body of knowledge, especially one that's growing as quickly as genetic data, it's difficult for the design of a databank to keep up. Several reworkings of GenBank have been done, but the flat-file format—in all its frustrating glory—still remains.

Due to a certain flexibility in the content of some sections of a GenBank record, extracting the information you're looking for can be tricky. This flexibility is good, in that it allows you to put what you think is most important into the data's annotation. It's bad, because that same flexibility makes it harder to write programs that to find and extract the desired annotations. As a result, the trend has been towards more structure in the annotations.

Since Perl's data structures and its use of regular expressions make it a good tool for manipulating flat files, Perl is especially well-suited to deal with GenBank data. Using these features in Perl and building on the skills you've developed from previous chapters, you can write programs to access the accumulated genetic knowledge of the scientific community in GenBank.

Since this is a beginning book that requires no programming experience, you should not expect to find the most finished, multipurpose software here. Instead you'll find a solid introduction to parsing and building fast lookup tables for GenBank files. If you've never done so, I strongly recommend you explore the National Center for Biotechnology Information (NCBI) at the National Institutes of Health (NIH) ( While you're at it, stop by the European Bioinformatics Institute (EBI) at and the bioinformatics arm of the European Molecular Biology Laboratory (EMBL) at These are large, heavily funded governmental bioinformatics powerhouses, and they have (and distribute) a great deal of state-of-the-art bioinformatics software.

10.2 GenBank Libraries

GenBank is distributed as a set of libraries—flat files containing many records in succession.[2] As of GenBank release 125.0, August 2001, there are 243 files, most of which are over 200 MB in size. Altogether, GenBank contains 12,813516 loci and 13,543,364,296 bases from 12,813,516 reported sequences. The libraries are usually distributed compressed, which means you can download somewhat smaller files, but you need to uncompress them after you received them. Uncompressed, this amounts to about 50 GB of data. Since 1982, the number of sequences in GenBank has doubled about every 14 months.

[2] The data is also distributed in the ASN.1 format.

GenBank libraries are further organized into divisions by the classification of the sequences they contain, either phylogenetically or by sequencing technology. Here are the divisions:

  • PRI: primate sequences

  • ROD: rodent sequences

  • MAM: other mammalian sequences

  • VRT: other vertebrate sequences

  • INV: invertebrate sequences

  • PLN: plant, fungal, and algal sequences

  • BCT: bacterial sequences

  • VRL: viral sequences

  • PHG: bacteriophage sequences

  • SYN: synthetic and chimeric sequences

  • UNA: unannotated sequences

  • EST: EST sequences (expressed sequence tags)

  • PAT: patent sequences

  • STS: STS sequences (sequence tagged sites)

  • GSS: GSS sequences (genome survey sequences)
  • HTG: HTGS sequences (high throughput genomic sequencing data)

  • HTC: HTC sequences (high throughput cDNA sequencing data)

Some divisions are very large: the largest, the EST, or expressed sequence tag division, is comprised of 123 library files! A portion of human DNA is stored in the PRI division, which contains (as of this writing) 13 library files, for a total of almost 3.5 GB of data. Human data is also stored in the STS, GSS, HTGS, and HTC divisions. Human data alone in GenBank makes up almost 5 million record entries with over 8 trillion bases of sequence.

The public database servers such as Entrez or BLAST at give you access to well-maintained and updated sequence data and programs, but many researchers find that they need to write their own programs to manipulate and analyze the data. The problem is, there's so much data. For many purposes, you can download a selected set of records from NCBI or other locations, but sometimes you need the whole dataset.

It's possible to set up a desktop workstation (Windows, Mac, Unix, or Linux) that contains all of GenBank; just be sure to buy a very large hard disk! Getting all that data onto your hard drive, however, is more difficult. A Perl program called helps to address this need. Downloading it, even with a university-standard, high-speed Internet connection can be time-consuming; downloading an entire dataset with a modem can be an exercise in frustration. The best solution is to download only the files you need, in compressed form. The EST data, for example, is about half the entire database; don't download it unless you really need to. If you need to download GenBank, I recommend contacting the help desk at NCBI. They'll help you get the most up-to-date information.

Since you're learning to program, it makes more sense to practice on a tiny, five-record library file, but the programs you'll write will work just fine on the real files.

10.3 Separating Sequence and Annotation

In previous chapters you saw how to examine the lines of a file using Perl's array operations. Usually, you do this by saving the data in an array with each appearing as an element of the array.

Let's look at two methods to extract the annotation and the DNA from a GenBank file. In the first method, you'll slurp the file into an array and look through the lines, as in previous programs. In the second, you'll put the whole GenBank record into a scalar variable and use regular expressions to parse the information. Is one approach better than the other? Not necessarily: it depends on the data. There are advantages and disadvantages to each, but both get the job done.

I've put five GenBank records in a file called As before, you can download the file from this book's web site. You'll use this datafile and the file in the next few examples.

10.3.1 Using Arrays

Example 10-1 shows the first method, which operates on an array containing the lines of the GenBank record. The main program is followed by a subroutine that does the real work.

Example 10-1. Extract annotation and sequence from GenBank file


# Extract annotation and sequence from GenBank file
use strict;

use warnings;

use BeginPerlBioinfo; # see Chapter 6 about this module
# declare and initialize variables

my @annotation = ( );

my $sequence = '';

my $filename = '';

parse1(\@annotation, \$sequence, $filename);

# Print the annotation, and then

# print the DNA in new format just to check if we got it okay.

print @annotation;
print_sequence($sequence, 50);

# Subroutine

# parse1


# --parse annotation and sequence from GenBank record

sub parse1 {
my($annotation, $dna, $filename) = @_;
# $annotation--reference to array

# $dna --reference to scalar

# $filename --scalar

# declare and initialize variables

my $in_sequence = 0;

my @GenBankFile = ( );

# Get the GenBank data into an array from a file

@GenBankFile = get_file_data($filename);

# Extract all the sequence lines

foreach my $line (@GenBankFile) {

if( $line =~ /^\/\/\n/ ) { # If $line is end-of-record line //\n,

last; #break out of the foreach loop.

} elsif( $in_sequence) { # If we know we're in a sequence,

$$dna .= $line; # add the current line to $$dna.

} elsif ( $line =~ /^ORIGIN/ ) { # If $line begins a sequence,

$in_sequence = 1; # set the $in_sequence flag.

} else{ # Otherwise

push( @$annotation, $line); # add the current line to @annotation.



# remove whitespace and line numbers from DNA sequence

$$dna =~ s/[\s0-9]//g;


Here's the beginning and end of Example 10-1's output of the sequence data:












The foreach loop in subroutine parse1 in Example 10-1 moves one by one through the lines from the GenBank file stored in the array @GenBankFile. It takes advantage of the structure of a GenBank file, which begins with annotation and runs until the line:


is found, after which sequence appears until the end-of-record line // is reached. The loop uses a flag variable $in_sequence to remember that it has found the ORIGIN line and is now reading sequence lines.

The foreach loop has a new feature: the Perl built-in function last, which breaks out of the nearest enclosing loop. It's triggered by the end-of-record line //, which is reached when the entire record has been seen.

A regular expression is used to find the end-of-record line. To correctly match the end-of-record (forward) slashes, you must escape them by putting a backslash in front of each one, so that Perl doesn't interpret them as prematurely ending the pattern. The regular expression also ends with a newline \/\/\n, which is then placed inside the usual delimiters: /\/\/\n/. (When you have a lot of forward slashes in a regular expression, you can use another delimiter around the regular expression and precede it with an m, thus avoiding having to backslash the forward slashes. It's done like so: m!//\n!).

An interesting point about subroutine parse1 is the order of the tests in the foreach loop that goes through the lines of the GenBank record. As you read through the lines of the record, you want to first gather the annotation lines, set a flag when the ORIGIN start-of-sequence line is found, and then collect the lines until the end-of-record // line is found.

Notice that the order of the tests is exactly the opposite. First, you test for the end-of-record line, collect the sequence if the $in_sequence flag is set, and then test for the start-of-sequence ORIGIN line. Finally, you collect the annotation.

The technique of reading lines one by one and using flag variables to mark what section of the file you're in, is a common programming technique. So, take a moment to think about how the loop would behave if you changed the order of the tests. If you collected sequence lines before testing for the end-of-record, you'd never get to the end-of-record test!

Other methods of collecting annotation and sequence lines are possible, especially if you go through the lines of the array more than once. You can scan through the array, keeping track of the start-of-sequence and end-of-record line numbers, and then go back and extract the annotation and sequence using an array splice (which was described in the parseREBASE subroutine in Example 9-2). Here's an example:

# find line numbers of ORIGIN and // in the GenBank record

$linenumber = 0;

foreach my $line (@GenBankFile) {

if ( $line =~ /^//\n/ ) { # end-of-record // line

$end = $linenumber;


} elsif ( $line =~ /^ORIGIN/ ) { # end annotation, begin sequence

$origin = $linenumber;



# extract annotation and sequence with "array splice"

@annotation = @GenBankFile[0..($origin-1)];

@sequence = @GenBankFile[($origin+1)..($end-1)];

1   ...   15   16   17   18   19   20   21   22   ...   28

The database is protected by copyright © 2017
send message

    Main page