Beginning Perl for Bioinformatics

Download 1.4 Mb.
Date conversion29.03.2017
Size1.4 Mb.
1   ...   14   15   16   17   18   19   20   21   ...   28

8.7 Exercises

Exercise 8.1

Write a subroutine that checks a string and returns true if it's a DNA sequence. Write another that checks for protein sequence data.

Exercise 8.2

Write a program that can search by name for a gene in an unsorted array.

Exercise 8.3

Write a program that can search by name for a gene in a sorted array; use the Perl sort function to sort an array. For extra credit: write a binary search subroutine to do the searching.

Exercise 8.4

Write a subroutine that inserts an element into a sorted array. Hint: use the splice Perl function to insert the element, as shown in Chapter 4.

Exercise 8.5

Write a program that searches by name for a gene in a hash. Get the genes from your own work or try downloading a list of all genes for a given organism from or one of the web sites given in Appendix A. Make a hash of all the genes (key=name, value=gene ID or sequence). Hint: you may have to write a short Perl program to reformat the list of genes you start with to make it easy to populate the Perl hash.

Exercise 8.6

Write a subroutine that checks an array of data and returns true if it's in FASTA format. Note that FASTA expects the standard IUB/IUPAC amino acid and nucleic acid codes, plus the dash (-) that represents a gap of unknown length. Also, the asterisk (*) represents a stop codon for amino acids. Be careful using an asterisk in regular expressions; use a \* to escape it to match an actual asterisk.

The remaining problems deal with the effect of mutations in DNA on the proteins they encode. They combine the subject of randomization and mutations from Chapter 7 plus the subject of the genetic code from this chapter.

Exercise 8.7

For each codon, make note of what effect single nucleotide mutations have on the codon: does the same amino acid result, or does the codon now encode a different amino acid? Which one? Write a subroutine that, given a codon, returns a list of all the amino acids that may result from any single mutation in the codon.

Exercise 8.8

Write a subroutine that, given an amino acid, randomly changes it to one of the amino acids calculated in Exercise 8.7.

Exercise 8.9

Write a program that randomly mutates the amino acids in a protein but restricts the possibilities to those that can occur due to a single mutation in the original codons, as in Exercises 8.7 and 8.8.

Exercise 8.10

Some codons are more likely than others to occur in random DNA. For instance, there are 6 of the 64 possible codons that code for the amino acid serine, but only 2 of the 64 codes for phenylalanine. Write a subroutine that, given an amino acid, returns the probability that it's coded by a randomly generated codon (see Chapter 7).

Exercise 8.11

Write a subroutine that takes as arguments an amino acid; a position 1, 2, or 3; and a nucleotide. It then takes each codon that encodes the specified amino acid (there may be from one to six such codons), and mutates it at the specified position to the specified nucleotide. Finally, it returns the set of amino acids that are encoded by the mutated codons.

Exercise 8.12

Write a program that, given two amino acids, returns the probability that a single mutation in their underlying (but unspecified) codons results in the codon of one amino acid mutating to the codon of the other amino acid.

Chapter 9. Restriction Maps and Regular Expressions

In this chapter, I'll give an overview of Perl regular expressions and Perl operators, two essential features of the language we've been using all along. We'll also investigate the programming of a standard, fundamental molecular-biology technique: the discovery of a restriction map for a sequence. Restriction digests were one of the original ways to "fingerprint" DNA; this can now be simulated on the computer.

Restriction maps and their associated restriction digests are common calculations in the laboratory and are provided by several software packages. They are essential tools in the planning of cloning experiments; they can be used to insert a desired stretch of DNA into a cloning vector, for instance. Restriction maps also find application in sequencing projects, for instance in shotgun or directed sequencing.

9.1 Regular Expressions

We've been dealing with regular expressions for a while now. This section fills in some background an.d ties together the somewhat scattered discussions of regular expressions from earlier parts of the book.

Regular expressions are interesting, important, and rich in capabilities. Jeffrey Friedl's book Mastering Regular Expressions (O'Reilly) is entirely devoted to them. Perl makes particularly good use of regular expressions, and the Perl documentation explains them well. Regular expressions are useful when programming with biological data such as sequence, or with GenBank, PDB, and BLAST files.

Regular expressions are ways of representing—and searching for—many strings with one string. Although they are not strictly the same thing, it's useful to think of regular expressions as a kind of highly developed set of wildcards. The special characters in regular expressions are more properly known as metacharacters.

Most people are familiar with wildcards, which are found in search engines or in the game of poker. You might find the reference to every word that starts with biolog by typing biolog*, for instance. Or you may find yourself holding five aces. (Different situations may use different wildcards. Perl regular expressions use * to mean "0 or more of the preceding item," not "followed by anything" as in the wildcard example just given.)

In computer science, these kinds of wildcards or metacharacters have an important history, both practically and theoretically. The asterisk character in particular is called the Kleene closure after the eminent logician who invented it. As a nod to the theory, I'll mention there is a simple model of a computer, less powerful than a Turing machine, that can deal with exactly the same kinds of languages that can be described by regular expressions. This machine model is called a finite state automaton. But enough theory for now.

We've already seen many examples that use regular expressions to find things in a DNA or protein sequence. Here I'll talk briefly about the fundamental ideas behind regular expressions as an introduction to some terminology. There is a useful summary of regular-expression features in Appendix B. Finally, we'll see how to learn more about them in the Perl documentation.

So let's start with a practical example that should be familiar by now to those who have been reading this text sequentially: using character classes to search DNA. Let's say there is a small motif you'd like to find in your library of DNA that is six basepairs long: CT followed by C or G or T followed by ACG. The third nucleotide in this motif is never A, but it can be C, G, or T. You can make a regular expression by letting the character class [CGT] stand for the variable position. The motif can then be represented by a regular expression that looks like this: CT[CGT]ACG. This is a motif that is six base pairs long with a C,G, or T in the third position. If your DNA was in a scalar variable $dna, you can test for the presence of the motif by using the regular expression as a conditional test in a pattern-matching statement, like so:

if( $dna =~ /CT[CGT]ACG/ ) {

print "I found the motif!!\n";


Regular expressions are based on three fundamental ideas:

Repetition (or closure)

The asterisk (*), also called Kleene closure or star, indicates 0 or more repetitions of the character just before it. For example, abc* matches any of these strings: ab, abc, abcc, abccc, abcccc, and so on. The regular expression matches an infinite number of strings.


In Perl, the pattern (a|b) (read: a or b) matches the string a or the string b.


This is a real obvious one. In Perl, the string ab means the character a followed by (concatenated with) the character b.

The use of parentheses for grouping is important: they are also metacharacters. So, for instance, the string (abc|def)z*x matches such strings as abcx, abczx, abczzx, defx, defzx, defzzzzzx, and so on. In English, it matches either abc or def followed by zero or more z's, and ending with an x. This example combines the ideas of grouping, alternation, closure, and concatenation. The real power of regular expressions is seen in this combining of the three fundamental ideas.

Perl has many regular-expression features. They are basically shortcuts for the three fundamental ideas we've just seen—repetition, alternation, and concatenation. For instance, the character class shown earlier can be written using alternation as (C|G|T). Another common feature is the period, which can stand for any character, except a newline. So ACG.*GCA stands for any DNA that starts with ACG and ends with GCA. In English, this reads as: ACG followed by 0 or more characters followed by GCA.

In Perl, regular expressions are usually enclosed within forward slashes and are used as pattern-matching specifiers. Check the documentation (or Appendix B), for m//, which includes some options that affect the behavior of the regular expressions. Regular expressions are also used in many of Perl's built-in commands, as you will see.

The Perl documentation is essential: start with the perlre section of the Perl manual at

9.2 Restriction Maps and Restriction Enzymes

One of the great discoveries in molecular biology, which paved the way for the current golden age in biological research, was the discovery of restriction enzymes. For the nonbiologist, and to help set up the programming material that follows, here's a short overview.

9.2.1 Background

Restriction enzymes are proteins that cut DNA at short, specific sequences; for example, the popular restriction enzymes EcoRI and HindIII are widely used in the lab. EcoRI cuts where it finds GAATTC, between the G and A. Actually, it cuts both complementary strands, leaving an overhang on each end. These "sticky ends" of a few bases in single strands make it possible for the fragments to re-form, making possible the insertion of DNA into vectors for cloning and sequencing, for instance. HindIII cuts at AAGCTT and cuts between the As. Some restriction enzymes cut in the middle and result in "blunt ends" with no overhang. About 1,000 restriction enzymes are known.

If you look at the reverse complement of the restriction enzyme EcoRI, you see it's GAATTC, the same sequence. This is a biological version of a palindrome, a word that reads the same in reverse. Many restriction sites are palindromes.

Computing restriction maps is a common and practical bioinformatics calculation in the laboratory. Restriction maps are computed to plan experiments, to find the best way to cut DNA to insert a gene, to make a site-specific mutation, or for several other applications of recombinant DNA techniques. By computing first, the laboratory scientist saves considerably on the necessary trial-and-error at the laboratory bench. Look for more about restriction enzymes at

We'll now write a program that does something useful in the lab: it will look for restriction enzymes in a sequence of DNA and report back with a restriction map of exactly where in the DNA the restriction enzymes appear.

9.2.2 Planning the Program

Back in Chapter 5, you saw how to look for regular expressions in text. So you've an idea of how to find motifs in sequences with Perl. Now let's think about how to use those techniques to create restriction maps. Here are some questions to ask:

Where do I find restriction enzyme data?

Restriction enzyme data can be found at the Restriction Enzyme Database, (REBASE), which is on the Web at

How do I represent restriction enzymes in regular expressions?

Exploring that site, you'll see that restriction enzymes are represented in their own language. We'll try to translate that language into the language of regular expressions.

How do I store restriction enzyme data?

There are about 1,000 restriction enzymes with names and definitions. This makes them candidates for the fast key-value type of lookup hashes provide. When you write a real application, say for the Web, it's a good idea to create a DBM file to store the information, ready to use when a program needs a lookup. I will cover DBM files in Chapter 10; here, I'll just demonstrate the principle. We'll keep only a few restriction enzyme definitions in the program.

How do I accept queries from the user?

You can ask for a restriction enzyme name, or you can allow the user to type in a regular expression directly. We'll do the first. Also, you want to let the user specify which sequence to use. Again, to simplify matters, you'll just read in the data from a sample DNA file.

How do I report back the restriction map to the user?

This is an important question. The simplest way is to generate a list of positions with the names of the restriction enzymes found there. This is useful for further processing, as it presents the information very simply.

But what if you don't want to do further processing; you just want to communicate the restriction map to the user? Then, perhaps it'd be more useful to present a graphical display, perhaps print out the sequence with a line above it that flags the presence of the enzymes.

There are lots of fancy bells and whistles you can use, but let's do it the simple way for now and output a list.

So, the plan is to write a program that includes restriction enzyme data translated into regular expressions, stored as the values of the keys of the restriction enzyme names. DNA sequence data will be used from the file, and the user will be prompted for names of restriction enzymes. The appropriate regular expression will be retrieved from the hash, and we'll search for all instances of that regular expression, plus their locations. Finally, the list of locations found will be returned.

9.2.3 Restriction Enzyme Data

The restriction enzyme data is available in a variety of formats, as a visit to the REBASE web site will show you. After looking around, you decide to get the information from the bionet file, which has a fairly simple layout. Here's the header and a few restriction enzymes from that file:

REBASE version 104 bionet.104


REBASE, The Restriction Enzyme Database

Copyright (c) Dr. Richard J. Roberts, 2001. All rights reserved.


Rich Roberts Mar 30 2001










AauI (Bsp1407I) T^GTACA












Acc16I (MstI) TGC^GCA




Acc65I (KpnI) G^GTACC

Acc113I (ScaI) AGT^ACT







AceI (TseI) G^CWGC









Your first task is to read this file and get the names and the recognition site (or restriction site) for each enzyme.To simplify matters for now, simply discard the parenthesized enzyme names.

How can this data be read?

Discard header lines

For each data line:
remove parenthesized names, for simplicity's sake
get and store the name and the recognition site
Translate the recognition sites to regular expressions

--but keep the recognition site, for printing out results

return the names, recognition sites, and the regular expressions

This is high-level undetailed pseudocode, so let's refine and expand it. (Notice that the curly brace isn't properly matched. That's okay, because there are no syntax rules for pseudocode; do whatever works for you!) Here's some pseudocode that discards the header lines:

foreach line
if /Rich Roberts/

break out of the foreach loop


This is based on the format of the file, in which the string you're looking for is the last text before the data lines start. (Of course, if the format of the file should change, this might no longer work.)

Now let's further expand the pseudocode, thinking how to do the tasks involved:

# Discard header lines

# This keeps reading lines, up to a line containing "Rich Roberts"

foreach line

if /Rich Roberts/

break out of the foreach loop

For each data line:
# Split the two or three (if there's a parenthesized name) fields

@fields = split( " ", $_);

# Get and store the name and the recognition site

$name = shift @fields;

$site = pop @fields;
# Translate the recognition sites to regular expressions

--but keep the recognition site, for printing out results


return the names, recognition sites, and the regular expressions

This isn't the translation, but let's look at what you've done.

First, you want to extract the name and recognition site data from a string. The most common way to separate words in a line of Perl, especially if the string is nicely formatted, is with the Perl built-in function split .

If you have two or three per line that have whitespace and are separated from each other by whitespace, you can get them into an array with the following simple call to split (which acts on the line as stored in the special variable @_.:

($name, $site) = split(" ")

The @fields array may have two or three elements depending on whether there was a parenthesized alternate enzyme named. But you always want the first and the last elements:

$name = shift@fields;

$site = pop@fields;

You now have the problem of translating the recognition site to a regular expression.

Looking over the recognition sites and having read the documentation on REBASE you found on its web site, you know that the cut site is represented by the caret (^). This doesn't help make a regular expression that finds the site in sequence, so you should remove it (see Exercise 9.6 in the Section 9.4 section).

Also notice that the bases given in the recognition sites are not just the bases A, C, G, and T, but they also use the more extended alphabet presented in Table 4-1. These additional letters include a letter for every possible group of two, three, or four bases. They're really like abbreviations for character classes in that respect. Aha! Let's write a subroutine that substitutes character classes for these codes, and then we'll have our regular expression.

Of course, REBASE uses them, because a given restriction enzyme might well match a few different recognition sites.

Example 9-1 is a subroutine that, given a string, translates these codes into character classes.

Example 9-1. Translate IUB ambiguity codes to regular expressions

# IUB_to_regexp


# A subroutine that, given a sequence with IUB ambiguity codes,

# outputs a translation with IUB codes changed to regular expressions


# These are the IUB ambiguity codes

# (Eur. J. Biochem. 150: 1-5, 1985):

# R = G or A

# Y = C or T

# M = A or C

# K = G or T

# S = G or C

# W = A or T

# B = not A (C or G or T)

# D = not C (A or G or T)

# H = not G (A or C or T)

# V = not T (A or C or G)

# N = A or C or G or T

sub IUB_to_regexp {
my($iub) = @_;
my $regular_expression = '';
my %iub2character_class = (

A => 'A',

C => 'C',

G => 'G',

T => 'T',

R => '[GA]',

Y => '[CT]',

M => '[AC]',

K => '[GT]',

S => '[GC]',

W => '[AT]',

B => '[CGT]',

D => '[AGT]',

H => '[ACT]',

V => '[ACG]',

N => '[ACGT]',

# Remove the ^ signs from the recognition sites

$iub =~ s/\^//g;

# Translate each character in the iub sequence

for ( my $i = 0 ; $i < length($iub) ; ++$i ) {


.= $iub2character_class{substr($iub, $i, 1)};


return $regular_expression;


It seems you're almost ready to write a subroutine to get the data from the REBASE datafile. But there's one important item you haven't addressed: what exactly is the data you want to return?

You plan to return three data items per line of the original REBASE file: the enzyme name, the recognition site, and the regular expression. This doesn't fit easily into a hash. You can return an array that stores these three data items in three consecutive slots. This can work: to read the data, you'd have to read groups of three items from the array. It's doable but might make lookup a little difficult. As you get into more advanced Perl, you'll find that you can create your own complex data structures.

Since you've learned about split, maybe you can have a hash in which the key is the enzyme name, and the value is a string with the recognition site and the regular expression separated by whitespace. Then you can look up the data fast and just extract the desired values using split. Example 9-2 shows this method.

Example 9-2. Subroutine to parse a REBASE datafile

# parseREBASE--Parse REBASE bionet file


# A subroutine to return a hash where

# key = restriction enzyme name

# value = whitespace-separated recognition site and regular expression

sub parseREBASE {
my($rebasefile) = @_;
use strict;

use warnings;

use BeginPerlBioinfo; # see Chapter 6 about this module
# Declare variables

my @rebasefile = ( );

my %rebase_hash = ( );

my $name;

my $site;

my $regexp;

# Read in the REBASE file

@rebasefile = get_file_data($rebasefile);

foreach ( @rebasefile ) {
# Discard header lines

( 1 .. /Rich Roberts/ ) and next;

# Discard blank lines

/^\s*$/ and next;

# Split the two (or three if includes parenthesized name) fields

my @fields = split( " ", $_);

# Get and store the name and the recognition site
# Remove parenthesized names, for simplicity's sake,

# by not saving the middle field, if any,

# just the first and last

$name = shift @fields;

$site = pop @fields;
# Translate the recognition sites to regular expressions

$regexp = IUB_to_regexp($site);

# Store the data into the hash

$rebase_hash{$name} = "$site $regexp";

# Return the hash containing the reformatted REBASE data

return %rebase_hash;


This parseREBASE subroutine does quite a lot. Is there, however, too much in one subroutine; should it be rewritten? It's a good question to ask yourself as you're writing code. In this case, let's leave it as it is. However, in addition to doing a lot, it also does it in a few new ways, which we'll look at now.

1   ...   14   15   16   17   18   19   20   21   ...   28

The database is protected by copyright © 2017
send message

    Main page