Beginning Perl for Bioinformatics

Download 1.4 Mb.
Date conversion29.03.2017
Size1.4 Mb.
1   ...   20   21   22   23   24   25   26   27   28
On iteration $i of the loop!




Here's the output from Example 12-3:

On iteration 0 of the loop!


On iteration 1 of the loop!


In Example 12-3, a here document was put in a for loop, so that you can see the $i variable changing in the printout. The variables are interpolated into a here document in the same way they are interpolated into a double-quoted string. Every time they go through the loop, the contents of the here document are subject to variable interpolation and are printed out. The terminating string used in this example, HEREDOC, can be any string you specify. (There are several options for dealing with things like indentation and so forth; I won't discuss them here and refer you to the Perl documentation.) Here documents are handy for some tasks, such as when you have a long, multiline document with just a few changes applied each time you print it. A business form letter, with only the addressee changed, is a typical example. Using a here document preserves the look of the final output in the code, while allowing variable interpolation.

12.5.3 format and write

Finally, let's take a look at the format and write functions. format is designed to generate reports and can handle page numbers, headers, and various layout options such as centering and left and right justification. It's modelled on the FORTRAN programming-language conventions for formatting and so is particularly handy for producing reports based on that style, such as the PDB file format, in which fields are specified as occupying certain columns on the line.

Example 12-4 is a short example of a format that creates a FASTA-style output.

Example 12-4. Example of format function to produce FASTA output


# Create fasta format DNA output with "format" function
use strict;

use warnings;

# Declare variables

my $id = 'A0000';

my $description = 'Highly weird DNA. This DNA is so unlikely!';


# Define the format

format STDOUT =

# The header line

>@<<<<<<<<< @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<...

$id, $description

# The DNA lines



# Print the fasta-formatted DNA output


Here's the output of Example 12-4:

>A0000 Highly unlikely DNA. This DNA is so...



After declaring and initializing the variables that fill in the form, the form is defined with:

format STDOUT =

and the format continues until it reaches the line with a period at the beginning.

The format is composed of three kinds of lines:

  • A comment beginning with the pound sign #

  • A picture line that specifies the layout of text

  • An argument line that names the variables that fill in the preceding picture line

The picture line and the argument line must be adjacent; they can't be separated by a comment line, for instance.

The first picture line/argument line combo is for the header information:

>@<<<<<<<<< @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<...

$id, $description

The picture line has two picture fields in it, associated with the variables $id and $description, respectively. The picture line begins with a greater-than sign, >, which is just text that begins each FASTA file header line, by definition. Then comes the first picture field, which is an @ sign followed by nine < signs. The @ sign declares a field that has the associated variable interpolated into it. The use of the nine less-than signs specifies that the value should be left-justified, for a total of 10 columns. If the value is bigger than 10 columns, it is truncated. A less-than sign left-justifies, a greater-than sign right-justifies, and a vertical bar | centers the data in the field.

The second picture field is almost identical. It is longer and ends with three dots (an ellipsis) which prints if the contents of the variable $description can't fit into the length of the picture field (which, in this case, is true.)

The next pair of picture/argument lines is:



The picture field starts with a caret, which declares a picture field that will handle variable-length records. The line also contains 49 less-than signs, for a total of 50 columns, left-justified. At the end are two tilde ~ signs, which indicate there should be additional lines for the data if it doesn't fit one on one line.

The write command simply prints the previously defined format. By default, the output goes to STDOUT, as is done in the example, but you can supply a filehandle to the format and write statements if you desire.

The upcoming release of Perl 6 will move formats out of the core of the language and make them into a module. Details are not available as of this writing, but this change will probably entail adding a statement such as use Formats; near the top of your code in order to load the module for using formats.

12.6 Bioperl

The Bioperl project is an important collection of Perl code for bioinformatics that has been in development since 1998. Although Bioperl uses the more advanced object-oriented style of Perl program design, it's possible to take an introductory look here at how it's organized and used.

The main focus of Bioperl modules is to perform sequence manipulation, provide access to various biology databases (both local and web-based), and parse the output of various programs.

Bioperl is available at Some of its features rely on having additional Perl modules—available from CPAN (—installed. This situation is quite common, and as you do more Perl programming, you'll become familiar with installing modules from CPAN. The Bioperl tutorials include information on installing Bioperl and additional modules for the three major operating systems: Unix or Linux, Mac, and Windows.

Bioperl doesn't provide complete programs. Rather, it provides a fairly large—and growing—set of modules for accomplishing common tasks, including some tasks you've seen in this book. You're responsible for writing the code that holds the modules together. By providing these ready and (usually) easy-to-use modules, Bioperl makes developing bioinformatics applications in Perl faster and easier. There are example programs for most of the modules, which can be examined and modified to get started.

Like many open source projects, Bioperl has suffered from fragmentation and uneven documentation, due to the strictly volunteer and geographically dispersed group of contributors. But recent work on the project leading up to Release 0.7 in March 2001 has significantly improved the project. In particular, there is now enough tutorial information on using the modules to enable you to make good use of the code.

Some difficulties still remain. Most of the code has been developed on Unix or Linux systems. Not all of it works on Macs or Windows operating systems, but most will. There are some documents available at the Bioperl web site that discuss using Bioperl on non-Unix computers, but the bottom line is that you might find that some things don't work.

If you're going to give Bioperl a try (and I strongly recommend you do), you should make sure you have a fairly recent version of Perl installed. You'll need at least Version 5.004; it would be much better to install the latest stable release from the Perl web site

12.6.1 Sample Modules

To give you an idea of what tasks Bioperl can make easier for you, Table 12-1 displays a representative sample of some of the most useful modules available.

Table 12-1. Bioperl modules




Sequence object, with features


Multiple alignments held as a set of sequences


Generic species object


Database object interface to ACeDB servers


Database object interface to GDB HTTP query


Database object interface to GenBank


Database object interface to GenPept


A collection of routines useful for queries to NCBI databases


Database object interface to SWISS-PROT retrieval


Interface for indexing FASTA files


Interface for indexing GenBank seq files, that is, flat files in GenBank format


Implementation of a simple location on a sequence


Implementation of a location on a sequence that has multiple locations


Holds pair feature information, e.g., BLAST hits


Generic SeqFeature


Sequence feature based on similarity


Sequence feature based on the similarity of two sequences


Feature representing an exon


Feature representing an arbitrarily complex structure of a gene


Feature representing a transcript


Interface for a feature representing a transcript of exons, promoter, UTR, and a poly-adenylation site


Bioperl BLAST sequence analysis object


Lightweight BLAST parser for pair-wise sequence alignment using the BLAST algorithm


Lightweight BLAST parser


Lightweight BLAST parser for PSIBLAST reports


Bioperl codon table object


Bioperl FASTA utility object


Generates unique seq objects from an ambiguous seq object


Bioperl object for a restriction endonuclease object


Bioperl object for a sequence pattern or motif


Object holding statistics for one particular sequence


Object holding n-mer statistics for one sequence


Bioperl BLAST high-scoring segment pair object


Bioperl utility module for HTML-formatting BLAST reports


Bioperl BLAST "hit" object


Bioperl module for running BLAST analyses locally


Bioperl module for running BLAST analyses using an HTTP interface


Predicted exon feature


Predicted gene structure feature


Sequence change class for polypeptides


Point mutation and codon information from single amino acid changes


Sequence object with allele-specific attributes


DNA-level mutation class


Handler for sequence variation I/O formats

12.6.2 Bioperl Tutorial Script

Bioperl has a tutorial script to help you try out various parts of the package. In this section, I'll show how to start up and run some example computations.

I've mentioned already that you should learn how to download code from CPAN in order to add modules such as Bioperl. A great deal of the usefulness of the Perl programming environment now resides in these modules available on CPAN. This was a design decision: by concentrating on the core Perl language, the Perl designers can focus on making the language as good as they can. The Perl module developers can then concentrate on their many modules. By all means, take a look around the CPAN web site for an idea of the wealth of Perl modules available to you.

I won't give the details of how to install Bioperl here: as mentioned, they are available at the Bioperl web site, or you can visit the CPAN web site for information.

So, let's assume you've installed the Bioperl module and looked over the tutorial at the Bioperl web site. Now, let's see how to try out some Bioperl programs.

Go to the directory where the Bioperl software has been built on your system. For instance, on my Linux computer, I put the download file bioperl-0.7.0.tar.gz into the directory /usr/local/src, and then unpacked it with the command:

tar xvzf bioperl-0.7.0.tar.gz

which creates the source directory /usr/local/src/bioperl-0.7.0. After installing the module (check the documentation), you're ready to run the tutorial script.

Change to the source directory and type perl Here's the result (I've shown the head of the tutorial to give the author and copyright information):

% head

# $Id: ch12,v 1.44 2001/10/10 20:37:42 troutman Exp mam $

=head1 BioPerl Tutorial
Cared for by Peter Schattner
Copyright Peter Schattner
This tutorial includes "snippets" of code and text from various

Bioperl documents including module documentation, example scripts

% perl
The following numeric arguments can be passed to run the corresponding demo-script.

1 => access_remote_db ,

2 => index_local_db ,

3 => fetch_local_db , (# NOTE: needs to be run with demo 2)

4 => sequence_manipulations ,

5 => seqstats_and_seqwords ,

6 => restriction_and_sigcleave ,

7 => other_seq_utilities ,

8 => run_standaloneblast ,

9 => blast_parser ,

10 => bplite_parsing ,

11 => hmmer_parsing ,

12 => run_clustalw_tcoffee ,

13 => run_psw_bl2seq ,

14 => simplealign_univaln ,

15 => gene_prediction_parsing ,

16 => sequence_annotation ,

17 => largeseqs ,

18 => liveseqs ,

19 => demo_variations ,

20 => demo_xml ,
In addition the argument "100" followed by the name of a single

bioperl object will display a list of all the public methods

available from that object and from what object they are inherited.
Using the parameter "0" will run all tests.

Using any other argument (or no argument) will run this display.

So typical command lines might be:

To run all demo scripts:

> perl -w 0

or to just run the local indexing demos:

> perl -w 2 3

or to list all the methods available for object Bio::Tools::SeqStats -

> perl -w 100 Bio::Tools::SeqStats


Now let's try option 9, the BLAST parser, and option 1, access_remote_db. So here goes, starting with the BLAST parser:

% perl 9
Beginning parser example...
QUERY NAME : gi|1401126


LENGTH : 504

FILE : t/

DATE : Thu, 16 Apr 1998 18:56:18 -0400


VERSION : 2.0.4 [Feb-24-1998]

DB-NAME : Non-redundant GenBank+EMBL+DDBJ+PDB sequences

DB-RELEASE : Apr 16, 1998 9:38 AM

DB-LETTERS : 677679054














LAMBDA, K, H : 0.270, 0.0470, 0.230 (SHARED STATS)


S : 42, 74 (SHARED STATS)



Number of hits is 4

Fraction identical for hit 1 is 0.25

Sequence identities for hsp of hit 1 are 66-68 70 73 76 79 80 87-89 114 117

119 131 144 146 149 150 152 156 162 165 168 170 171 176 178-182 184 187 190

191 205-207 211 214 217 222 226 241 244 245 249 256 266-268 270 278 284 291

296 304 306 309 311 316 319 324


This is an interesting way to parse BLAST output! Now let's look at the access of the remote DB:

% perl 1

Beginning remote database access example...

seq1 display id is MUSIGHBA1

seq2 display id is AF303112

Display id of first sequence in stream is AF041456


Well, that was less informative as an output, but it seems you can infer that the remote DB access was successful. (By the way, if you're unsuccessful with this, it may be that you're behind a firewall which is denying access—a not uncommon occurrence in universities or large companies.)

The documentation suggests running the script under the Perl debugger to watch what happens step by step. I concur with that suggestion but won't include the output here. Try it yourself!

Since that last example wasn't much fun, let's try one more: here's the sequence manipulation tutorial:

% perl 4
Beginning sequence_manipulations and SeqIO example...

First sequence in fasta format...






Seq object display id is Test1




Sequence from 5 to 10 is TTTCAT

Acc num is unknown

Moltype is dna

Primary id is Test1

Truncated Seq object sequence is TTTCAT

Reverse complemented sequence 5 to 10 is GTGCTA

Translated sequence 6 to 15 is LQRAICLCVD

Beginning 3-frame and alternate codon translation example...

ctgagaaaataa translated using method defaults : LRK*

ctgagaaaataa translated as a coding region (CDS): MRK
Translating in all six frames:

frame: 0 forward: LRK*

frame: 0 reverse-complement: LFSQ

frame: 1 forward: *ENX

frame: 1 reverse-complement: YFLX

frame: 2 forward: EKI

frame: 2 reverse-complement: IFS
Translating with all codon tables using method defaults:

1 : LRK*

2 : L*K*

3 : TRK*

4 : LRK*

5 : LSK*

6 : LRKQ

9 : LSN*

10 : LRK*

11 : LRK*

12 : SRK*

13 : LGK*

14 : LSNY

15 : LRK*

16 : LRK*

21 : LSN*


That was more fun, because this part of Bioperl is doing several things we've done in this book.

I hope this brief look at Bioperl has whetted your appetite for more. It's a good idea to explore this set of modules. A Perl module for parsing BLAST output called may also be of interest: it's now part of the Bioperl project.

12.7 Exercises

Exercise 12.1

Basic string matching. Write a program that looks for a query string in a target string. For instance, if the query string is "gone", it finds a match at position 22 of the target string "goof through the way-gone-osphere." Don't use regular expressions or any of Perl's built-in string-matching abilities; instead, examine individual positions in the strings, compare characters, and invent your own algorithm.

Exercise 12.2

Explore the NCBI BLAST web pages at Familiarize yourself with the purpose and use of the various component programs and read the tutorial information on the meaning of the statistics.

Exercise 12.3

Explore the Bioperl web pages at Download the code and install it on your computer.

Exercise 12.4

Perform BLAST searches at the NCBI web site. Search with DNA against DNA databases; then search with the same DNA against protein databases, and compare the output.

Exercise 12.5

Perform two BLAST searches with related sequences. Parse the BLAST output of the searches and extract the top 10 hits in the header annotation of each search. Write a program that reports on the differences and similarities between the two searches.

Exercise 12.6

Write a program that uses Bioperl to perform a BLAST search at the NCBI web site, then use Bioperl to parse the BLAST output.

Exercise 12.7

Using Bioperl modules mixed with your own code, write a program that runs BLAST on a set of DNA sequences and saves the IDs of the list of hits of each BLAST run sorted in arrays. Allow the user to view each list, to view hits in common between multiple lists and hits unique to one of multiple lists. For each hit, enable the user to fetch its entire GenBank record.

Example 12.8

Write an explanation of the code for the subroutine extract_HSP_information. Be sure to refer to the format of the data the code uses as input.

Chapter 13. Further Topics

This book's goal has been to help you learn basic Perl programming. In this chapter, I will point the way to further learning in Perl.

13.1 The Art of Program Design

My emphasis on the art of program design has determined the way in which the programs were presented. They've generally progressed from a discussion of problems and ideas, to pseudocode, to small groups of small, cooperating subroutines, and finally to a close-up discussion of the code. At several points you've seen more than one way to do the same task. This is an important part of a programmer's mindset: the knowledge of, and willingness to try, alternatives.

The other recurrent theme has been to explain the problem-solving strategies programmers rely on. These include knowing how to use such sources of information as searchable newsgroup archives, books, and language documentation; having a good working knowledge of debugging tools; and understanding basic algorithm and data structure design and analysis.

As your skills improve, and your programs become more complex, you'll find that these strategies take on a much more important role. Designing and coding programs to solve complex problems or crunch lots of complex data requires advanced problem-solving strategies. So it's worth your while to learn to think like a computer scientist as well as a biologist.

13.2 Web Programming

The Internet is the most important source of bioinformatics data. From FTP sites to web-enabled programs, the Perl-literate bioinformatician needs to be able to access web resources. Just about every lab has to have its own web page these days, and many grants even require it. You'll need to learn the basics about the HTML and XML markup languages that display web pages, about the difference between a web server and a web browser, and similar facts of life.

The popular module makes it fairly easy to create interactive web pages, and several other modules are available that make Internet programming tasks relatively painless. For instance, you can write code for your own web page that enables visitors to try out your latest sequence analyzer or search through your special-purpose database. You can also add code to your own programs to enable them to interact with other web sites, querying and retrieving data automatically. Collaborators who are geographically diverse can use such web programming to work cooperatively on a project.

13.3 Algorithms and Sequence Alignment

You will want to spend some time exploring the standard results in algorithms, as found in the texts recommended in Appendix A. A good place to start is the basic sequence alignment methods such as the Smith-Waterman algorithm. In terms of algorithms, the topics of parallelization, randomization, and approximation deserve at least a nodding acquaintance.

Sequence alignment is the subset of the family of algorithms called string matching algorithms that are used to find the extent of identity or similarity, or to find evidence of homology, between sequences. The Smith-Waterman algorithm, the treatment of gaps, the use of preprocessing, parallel techniques, the alignment of multiple sequences, and more are facets of this study.

13.4 Object-Oriented Programming

Object-oriented programming is a style of program design that provides a well-defined interface to data and subroutines (called methods in "OO-speak"). It's not hard to learn; it makes some things easy that would otherwise be hard (and vice versa, but you don't have to use it for everything!). A great deal of Perl code has been written in object-oriented style since the capability was added to the language a few years ago.

13.5 Perl Modules

I've frequently mentioned modules and CPAN—the large collection of Perl code—has a huge number of modules you can use. Most are free, but do make a point of checking for copyright restrictions and see the discussion in the Perl FAQs about copyright issues. These days, most modules, including the bulk of the code available on CPAN, are written in an object-oriented style. You'll need to extend your Perl knowledge to encompass this style, but you won't need an in-depth view of object-oriented techniques to use most modules in your programs.

13.5.1 Bioperl

An important and steadily developing suite of Perl modules for bioinformatics is the Bioperl project, which you can find at the web site These modules give you lots of capabilities, all ready to use.

13.6 Complex Data Structures

Perl can handle complex data structures. This is useful in many programming situations; it's also necessary to learn in order to read a lot of existing Perl code that might come your way.

For example, in this book, you've parsed a lot of data. To do so, you developed groups of subroutines, each fairly short, and each parsing different levels of the structure of the data. By using complex data structures, you can store your parse in a form that reflects the structure of the data. This, combined with object-oriented methods for accessing the parsed data, is a useful way to accomplish a parse.

Complex data structures depend on references, which I've touched on in discussions of call by reference and of File::Find.

13.7 Relational Databases

Relational databases are another area Perl programmers and bioinformaticians need to explore. There comes a time when flat files or DBM just won't do for managing the data of a medium- or large-sized project, and you must turn to relational databases. Although they take a bit more effort to set up and program, they offer a standard and reliable way to store data and ask questions about it. In this book, we briefly discussed relational databases and actually used a simple DBM database. In the course of your work, however, you're likely to encounter Oracle, MySQL, PostgreSQL, Sybase, and others.The Perl module DBI, which stands for Database Independence, makes it possible to write code for manipulating relational databases that doesn't depend (too much) on which database you're actually using.

The fact is, writing code to handle databases isn't hard to do. The hardest part is making sure that the database is installed with the proper libraries, that the proper Perl modules are in place, and that you know how to connect to the database from your program. Once you have those things in place, using the database is generally easy.
That said, relational databases have their own lore, and there is a substantial body of knowledge about designing and managing good databases. Many programmers specialize in these issues, and that's true for plenty of bioinformaticians as well, since there are many interesting research questions related to designing better biological databases.

13.8 Microarrays and XML

Microarrays (miniaturized chip-based "laboratories" for studying gene expression) and XML (Extensible Markup Language) are two modern developments that are coming together. Now that whole genomes are available, microarray techniques enable you to measure the relative levels of thousands of gene transcripts at a time, and with their help, we hope to unravel the many pathways and interactions between the thousands of genes and gene products in the cell. XML is, to be painfully brief, a kind of new and improved HTML that is emerging as a standard for storing and interchanging data. (This book was written making extensive use of XML.) XML is becoming an important interface to many new kinds of experimental data.

13.9 Graphics Programming

Good graphical representation of data is critical for making your results useful to your colleagues. Graphics programming language present data and results and interact with software applications via attractive and easy-to-navigate interfaces. Many bioinformatics programs deal with large amounts of data, and a graphical user interface (GUI) can mean the difference between an application that helps you do your work and one that wastes your time. GUIs such as those commonly found on web pages are important not only for the display of output but also for the collection of user input.

The point-and-click method of interacting with software applications is a basic standard. A good GUI makes an application or program much easier to use. One difficulty of GUIs and graphic data displays, however, is that they tend to be less portable than programs with simpler graphics. You may want to explore the graphics capabilities of such Perl modules as Tk and GD, among others.

13.10 Modeling Networks

Networks of interacting biological systems, such as genes and gene products, can be modeled and investigated using graph algorithms. Despite the similarity to the term "graphics," graph algorithms are a different entity based on the discrete mathematical field of graph theory. Algorithms on graphs and their many variants (such as Petri nets) can store and investigate the properties of biochemical pathways and intra- and intercellular signalling pathways, for example.

13.11 DNA Computers

For the forward-thinking scientist, it is interesting and instructive to learn about new trends in computing such as DNA computers, optical computing, and quantum computing. DNA computers are especially fun. They use standard molecular biology laboratory techniques as a model of a general-purpose computer. They can implement algorithms, store data, and in general behave like a "real" computer. They are impractical as of this writing, but they are really fun to think about, and someday, who knows?

Appendix A. Resources

There is a wide array of resource material for Perl and for bioinformatics programming. This list is not at all exhaustive, but it includes those resources, both online and in print, that I think you may find interesting and useful as you expand your Perl programming repertoire.

A.2 Computer Science

Even though you're programming for biological applications, you'll often find yourself venturing into the realm of traditional computer science. Here are some published resources to help you find your way.

A.2.1 Algorithms

Mastering Algorithms with Perl; by Jon Orwant, Jarkho Hietaniemi, and John Macdonald; O'Reilly & Associates. The best book for noncomputer scientists who program in Perl.

Introduction to Algorithms; by Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest; MIT Press and McGraw-Hill. This is a really good book on algorithms—in many ways, the best. It's one of the standard university texts (arguably the standard text) at both the graduate and undergraduate levels. It works well as a textbook and as a reference. Its target audience is computer-science students, so there is a fair amount of math included, but even nonmathematical programmers will find this book very helpful.

Fundamentals of Algorithmics, by Gilles Brassard and Paul Bratley, Prentice Hall. An easy overview of algorithmic techniques.

Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology; by Dan Gusfield; Cambridge University Press. This book specializes in algorithms for strings, including such topics as sequence alignment. It's very detailed, but even so, not complete: this is a big field! The best single source on string algorithms, with lots of information about biological sequence similarity.

The following books are for advanced study.

The Design and Analysis of Computer Algorithms; by Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman; Addison-Wesley. This is the classic book on the science of algorithms.

Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes; by Frank Thomson Leighton; Morgan Kaufmann. A comprehensive and rigorous text and reference.

Randomized Algorithms, by Rajeev Motwani and Prabhakar Raghavan, Cambridge University Press. A clear, rigorous book.

A.2.2 Software Engineering

Software Engineering, Second Edition; by Ian Sommerville; Addison-Wesley. A good, general book that covers lots of important topics and generally avoids taking sides for or against competing styles.

A.2.3 Theory of Computer Science

Introduction to Automata Theory, Languages, and Computation, Second Edition; by John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman; Addison-Wesley. The classic text on computer-science theory.

Computers and Intractability: A Guide to the Theory of Np-Completeness, by Michael R. Garey and David S. Johnson, W.H. Freeman & Co. The classic, and superb, book on the topic.

A.2.4 General Programming

The Unix Programmers Manual, Steven V. Earhart, ed., Harcourt, Brace and Jovanovich School. This manual for Unix (whatever version of Unix) is a crash course in computer science with an emphasis on programming. The design of the interacting programs, and the concepts of pipes, redirection, processes, and so on, has been one of the great success stories of programming. This manual summarizes the system: Part I documents user programs; Parts II and III document the programming interface. The programmable shell, and the programs grep, awk, and sed were some of the primary inspirations for Perl.

The C Programming Language, by Brian W. Kernighan and Dennis M. Ritchie, Prentice Hall PTR. C and C++ are important languages in bioinformatics, and this classic book teaches C. If you work through the book, attempting all the programming exercises, you'll have some excellent programming training.

Structure and Interpretation of Computer Programs; by Harold Abelson, Gerald Jay Sussman, and Juke Sussman; MIT Press. A really interesting book that looks deeply at programming in the context of learning a dialect of Lisp.

The Unix Programming Environment, by Brian W. Kernighan and Robert Pike, Prentice Hall. This book is fun, and it talks about good software design.

A.3 Linux

If you have a Linux system, you have all the source code for the entire system available (this is also true for some Unix systems). (If it's not installed, you can get it from the distribution CDs, from the web site, or from the web site of the company that produced your version of Linux.) This is a great resource. You can take a look at how any program is actually written, even the operating system. Now you're really getting into programming!

A.4 Bioinformatics

Bioinformatics is a relatively new discipline that's attracting a lot of attention, so the available resources are multiplying fairly quickly. Here are a few books and other resources to help get you started.

A.4.1 Books

Developing Bioinformatics Computer Skills, by Cynthia Gibas and Per Jambeck, O'Reilly & Associates. This is a really good book for beginners. It covers setting up a Linux workstation and the installation and use of many of the best, and least expensive, bioinformatics programs. It teaches how to use bioinformatics programs, not how to program. It's the most practical bioinformatics book available.

Introduction to Computational Biology: Maps, Sequences and Genomes; by Michael S. Waterman; CRC Press. This is a classic book with a predominantly statistical outlook.

Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, Second Edition; edited by Andreas D. Baxecvanis and B.F. Francis Ouellette; John Wiley & Sons. Includes chapters on a wide range of topics by several authors.

A.4.2 Governmental Organizations

Absolutely essential. The following web sites are for the most important government-sponsored bioinformatics organizations:

the National Center for Biotechnology Information (NCBI). The U.S. government center.

the European Molecular Biology Laboratory (EMBL). The European Union laboratory.

the European Bioinformatics Institute (EBI) of EMBL.

A.4.3 Conferences

Bioinformatics has long been a part of various biology conferences, for instance the Cold Spring Harbor conferences on sequencing. Now there are many conferences with such coverage, often under the heading of "genomics." Here are a few interesting conferences:

  • ISMB: Intelligent Systems for Molecular Biology, now in its ninth year

  • Bioinformatics Open Source Conference,

  • RECOMB: Conference on Computational Molecular Biology

Appendix B. Perl Summary

This appendix summarizes those parts of the Perl programming language that will be most useful to you as you read this book. It is not a comprehensive summary of the Perl language. Remember that Perl is designed so that you don't need to know everything in order to use it. Source material for this appendix came from Programming Perl, Third Edition (O'Reilly & Associates).

B.1 Command Interpretation

The Perl programs in this book start with the line:

#!/usr/bin/perl -w

On Unix (or Linux) systems, the first line of a file can include the name of a program and some flags, which are optional. The line must start with #!, followed by the full pathname of the program (in our case, the Perl interpreter), followed optionally by a single group of one or more flags.

If the Perl program file was called myprogram, and had executable permissions, you can type myprogram (or possibly ./myprogram, or the full or relative pathname for the program) to start the program running.

The Unix operating system starts the program specified in the command interpretation line and gives it as input the rest of the file after the first line. So, in this case, it starts the Perl interpreter and gives it the program in the file to run.

This is just a shortcut for typing:

/usr/bin/perl -w myprogram

at the command line.

B.2 Comments

A comment begins with a # sign and continues from there to the end of the same line. It is ignored by the Perl interpreter and is only there for programmers to read. A comment can include any text.

B.3 Scalar Values and Scalar Variables

A scalar value is a single item of data, like a string or a number.

B.3.1 Strings

Strings are scalar values and are written as text enclosed within single quotes, like so:

'This is a string in single quotes.'

or double quotes, such as:

"This is a string in double quotes."

A single-quoted string prints out exactly as written. With double quotes, you can include a variable in the string, and its value will be inserted or "interpolated." You can also include commands such as \n to represent a newline (see Table B-3):

$aside = '(or so they say)';

$declaration = "Misery\n $aside \nloves company.";

print $declaration;

This snippet prints out:


(or so they say)

loves company.

B.3.2 Numbers

Numbers are scalar values that can be:

  • Integers:




  • F loating-point (decimal):


  • Scientific (exponential) notation (3.13 x 1023 or 313000000000000000000000):


  • Hexadecimal (base 16) :


  • Octal (base 8):


  • Binary (base 2):


Complex (or imaginary) numbers, such as 3 + i, and fractions (or ratios, or rational numbers), such as 1/3, can be a little tricky. Perl can handle fractions but converts them internally to floating-point numbers, which can make certain operations go wrong (Perl is not alone among computer languages in this regard.):

if ( 10/3 == ( (1/3) * 10 ) {

print "Success!";

}else {

print "Failure!";


This prints:


To properly handle rational arithmetic with fractions, complex numbers, or many other mathematical constructs, there are mathematics modules available, which aren't covered here.

B.3.3 Scalar Variables

Scalar values can be stored in scalar variables. A scalar variable is indicated with a $ before the variable's name. The name begins with a letter or underscore and can have any number of letters, underscores, or digits. A digit, however, can't be the first character in a variable name. Here are some examples of legal names of scalar variables:



Here are some improper names for scalar variables:



Names are case sensitive: $dna is different from $DNA.

These rules for making proper variable names (apart from the beginning $) also hold for the names of array and hash variables and for subroutine names.

A scalar variable may hold any type of scalar value mentioned previously, such as strings or the different types of numbers.

B.4 Assignment

Scalar variables are assigned scalar values with an assignment statement. For instance:

$thousand = 1000;

assigns the integer 1,000, a scalar value, to the scalar variable $thousand.

The assignment statement looks like an equal sign from elementary mathematics, but its meaning is different. The assignment statement is an instruction, not an assertion. It doesn't mean "$thousand equals 1,000." It means "store the scalar value 1,000 into the scalar variable $thousand". However, after the statement, the value of the scalar variable $thousand is, indeed, equal to 1000.

You can assign values to several scalar variables by surrounding variables and values in parentheses and separating them by commas, thus making lists:

($one, $two, $three) = ( 1, 2, 3);

There are several assignment operators besides = that are shorthand for longer expressions. For instance, $a += $b is equivalent to $a = $a + $b. Table B-1 is a complete list (it includes several operators that aren't covered in this book).

Table B-1. Assignment operator shorthands

Example of operator


$a += $b

$a = $a + $b (addition)

$a -= $b

$a = $a - $b (subtraction)

$a *= $b

$a = $a * $b (multiplication)

$a /= $b

$a = $a / $b (division)

$a **= $b

$a = $a ** $b (exponentiation)

$a %= $b

$a = $a % $b (remainder of $a / $b)

$a x= $b

$a = $a x $b (string $a repeated $b times)

$a &= $b

$a = $a & $b (bitwise AND)

$a |= $b

$a = $a | $b (bitwise OR)

$a ^= $b

$a = $a ^ $b (bitwise XOR)

$a >>= $b

$a = $a >> $b ($a shift $b bits)

$a <<= $b

$a = $a >> $b ($a shift $b bits to left)

$a &&= $b

$a = $a && $b (logical AND)

$a ||= $b

$a = $a || $b (logical OR)

$a .= $b

$a = $a . $b (append string $b to $a)

1   ...   20   21   22   23   24   25   26   27   28

The database is protected by copyright © 2017
send message

    Main page