Beginning Perl for Bioinformatics

Extracting Primary Sequence

Download 1.4 Mb.
Date conversion29.03.2017
Size1.4 Mb.
1   ...   20   21   22   23   24   25   26   27   28

11.4.1 Extracting Primary Sequence

Let's examine the subroutine extractSEQRES , now that the record types have been parsed out, and extract the primary amino acid sequence.

You need to extract each chain separately and return an array of one or more strings of sequence corresponding to those chains, instead of just one sequence.

The previous parse, in Example 11-4, left the required SEQRES record type, which stretches over several lines, in a scalar string that is the value of the key 'SEQRES' in a hash. Our success with the previous parsePDBrecordtypes subroutine that used iteration over lines (as opposed to regular expressions over multiline strings) leads to the same approach here. The split Perl function enables you to turn a multiline string into an array.

As you iterate through the lines of the SEQRES record type, notice when a new chain is starting, save the previous chain in @results, reset the $sequence array, and reset the $lastchain flag to the new chain. Also, when done with all the lines, make sure to save the last sequence chain in the @results array.

Also notice (and verify by exploring the Perl documentation for the function) that split, with the arguments you gave it, does what you want.

The third and final subroutine of Example 11-5 is called iub3to1 . Since in PDB the sequence information is in three-character codes, you need this subroutine to change those sequences into one-character codes. It uses a straightforward hash lookup to perform the translation.

We've now decomposed the problem into a few complementary subroutines. It's always interesting as to how to best divide a problem into cooperating subroutines. You can put the call to iub3to1 inside the extractSEQRES subroutine; that might be a cleaner way to package these subroutines together, since, outside the PDB file format, you won't have use for the strings of amino acids in three-character codes.

The important observation at this juncture is to point out that a few short subroutines, tied together with a very short main program, were sufficient to do a great deal of parsing of PDB files.

11.4.2 Finding Atomic Coordinates

So far, I've tried not to give more than a very brief overview of protein structure. However, in parsing PDB files, you will be faced with a great deal of detailed information about the structures and the experimental conditions under which they were determined. I will now present a short program that extracts the coordinates of atoms in a PDB file. I don't cover the whole story: for that, you will want to read the PDB documentation in detail and consult texts on protein structure, X-ray crystallography, and NMR techniques.

That said, let's extract the coordinates from the ATOM record type. ATOM record types are the most numerous of the several record types that deal with atomic-coordinate data: MODEL, ATOM, SIGATM, ANISOU, SIGUIJ, TER, HETATM, and ENDMDL. There are also several record types that handle coordinate transformation: ORIGXn, SCALEn, MTRIXn, and TVECT.

Here is part of the PDB documentation that shows the field definitions of each ATOM record:

The ATOM records present the atomic coordinates for standard residues.

They also present the occupancy and temperature factor for each atom.

Heterogen coordinates use the HETATM record type. The element symbol

is always present on each ATOM record; segment identifier and charge

are optional.
Record Format


1 - 6 Record name "ATOM "
7 - 11 Integer serial Atom serial number.
13 - 16 Atom name Atom name.
17 Character altLoc Alternate location indicator.

18 - 20 Residue name resName Residue name.

22 Character chainID Chain identifier.
23 - 26 Integer resSeq Residue sequence number.
27 AChar iCode Code for insertion of residues.
31 - 38 Real(8.3) x Orthogonal coordinates for X in


39 - 46 Real(8.3) y Orthogonal coordinates for Y in


47 - 54 Real(8.3) z Orthogonal coordinates for Z in


55 - 60 Real(6.2) occupancy Occupancy.
61 - 66 Real(6.2) tempFactor Temperature factor.
73 - 76 LString(4) segID Segment identifier, left-justified.
77 - 78 LString(2) element Element symbol, right-justified.
79 - 80 LString(2) charge Charge on the atom.

Here is a typical ATOM line:

ATOM 1 N GLY A 2 1.888 -8.251 -2.511 1.00 36.63 N

Let's do something fairly simple: let's extract all x, y, and z coordinates for each atom, plus the serial number (a unique integer for each atom in the molecule) and the element symbol. Example 11-6 is a subroutine that accomplishes that, with a main program to exercise the subroutine.

Example 11-6. Extract atomic coordinates from PDB file


# Extract atomic coordinates from PDB file
use strict;

use warnings;

use BeginPerlBioinfo; # see Chapter 6 about this module
# Read in PDB file

my @file = get_file_data('pdb/c1/pdb1c1f.ent');

# Parse the record types of the PDB file

my %recordtypes = parsePDBrecordtypes(@file);

# Extract the atoms of all chains in the protein

my %atoms = parseATOM ( $recordtypes{'ATOM'} );

# Print out a couple of the atoms

print $atoms{'1'}, "\n";

print $atoms{'1078'}, "\n";

# Subroutines of Example 11-6

# parseATOM


# --extract x, y, and z coordinates, serial number and element symbol

# from PDB ATOM record type

# Return a hash with key=serial number, value=coordinates in a string

sub parseATOM {
my($atomrecord) = @_;
use strict;

use warnings;

my %results = ( );
# Turn the scalar into an array of ATOM lines

my(@atomrecord) = split(/\n/, $atomrecord);

foreach my $record (@atomrecord) {

my $number = substr($record, 6, 5); # columns 7-11

my $x = substr($record, 30, 8); # columns 31-38

my $y = substr($record, 38, 8); # columns 39-46

my $z = substr($record, 46, 8); # columns 47-54

my $element = substr($record, 76, 2); # columns 77-78

# $number and $element may have leading spaces: strip them

$number =~ s/^\s*//;

$element =~ s/^\s*//;
# Store information in hash

$results{$number} = "$x $y $z $element";


# Return the hash

return %results;


The parseATOM subroutine is quite short: the strict format of these ATOM records makes parsing the information quite straightforward. You first split the scalar argument, which contains the ATOM lines, into an array of lines.

Then, for each line, use the substr function to extract the specific columns of the line that contains the needed data: the serial number of the atom; the x, y, and z coordinates; and the element symbol.

Finally, save the results by making a hash with keys equal to the serial numbers and values set to strings containing the other four relevant fields. Now, this may not always be the most convenient way to return the data. For one thing, hashes are not sorted on the keys, so that would need to be an additional step if you had to sort the atoms by serial number. In particular, an array is a logical choice to store information sorted by serial number. Or, it could be that what you really want is to find all the metals, in which case, another data structure would be suggested. Nevertheless, this short subroutine shows one way to find and report information.

It often happens that what you really need is a reformatting of the data for use by another program. Using the technique of this subroutine, you can see how to extract the needed data and add a print statement that formats the data into the desired form. Take a look at the printf and sprintf functions to get more specific control over the format. For real heavy-duty formatting, there's the format function, which merits its own chapter in O'Reilly's comprehensive Programming Perl. (See also Chapter 12 and Appendix B of this book.)

Here's the output from Example 11-6:

1.888 -8.251 -2.511 N

18.955 -10.180 10.777 C

You can now extract at least the major portion of the atomic coordinates from a PDB file. Again, notice the good news: it doesn't take a long or particularly complex program to do what is needed.

This program has been designed so that its parts can be used in the future to work well for other purposes. You parse all record types, for instance, not only the ATOM record types. Let's take a look at a very short program that just parses the ATOM record type lines from an input file; by targeting only this one problem, you can write a much shorter program. Here's the program:

while(<>) {

/^ATOM/ or next;

my($n, $x, $y, $z, $element)

= ($_ =~ /^.{6}(.{5}).{19}(.{8})(.{8})(.{8}).{22}(..)/);

# $n and $element may have leading spaces: strip them

$n =~ s/^\s*//;

$element =~ s/^\s*//;
if (($n == 1) or ($n == 1078)) {

printf "%8.3f%8.3f%8.3f %2s\n", $x, $y, $z, $element;



For each line, a regular-expression match extracts just the needed information. Recall that a regular expression that contains parentheses metacharacters returns an array whose elements are the parts of the string that matched within the parentheses. You assign the five variables $number, $x, $y, $z, and $element to the substrings from these matched parentheses.

The actual regular expression is simply using dots and the quantifier operator .{num} to stand for num characters. In this way, you can, starting from the beginning of the string as represented by the caret ^ metacharacter, specify the columns with the information you want returned by surrounding them with parentheses.

For instance, you don't want the first six characters, so you specify them as ^.{6}, but you do want the next five characters because they contain the serial number of the atom; so, specify that field as (.{5}).

Frankly, I think that the use of substr is clearer for this purpose, but I wanted to show you an alternative way using regular expressions as well.

We've already seen the use of the printf function to format output with more options then with the print function.

This program has another important shortcut. It doesn't specify the file to open and read from. In Perl, you can give the input filename on the command line (or drag and drop it onto a Mac droplet), and the program takes its input from that file. Just use the angle brackets as shown in the first line of the program to read from the file. For short, fast programs, such as the one demonstrated here, this is a great convenience. You can leave out all the calls and tests for success of the open function and just use the angle brackets. You would call it from the command line like so, assuming you saved the program in a file called get_two_atoms:

%perl get_two_atoms pdb1a4o.ent

Alternatively, you can pipe data to the program with the commands:

% cat | perl get_two_atoms


% perl get_two__atoms < pdb1a40.ent

and use instead of <> in your program to read the data.

11.5 Controlling Other Programs

Perl makes it easy to start other programs and collect their output, all from within your Perl program. This is an extremely useful capability; for most programs, Perl makes it fairly simple to accomplish.

You may need to run some particular program many times, for instance over every file in PDB to extract secondary structure information. The program itself may not have a way to tell it "run yourself over all these files." Also, the output of the program may have all sorts of extraneous information. What you need is a much simpler report that just presents the information that interests you—perhaps in a format that could then be input to another program! With Perl you can write a program to do exactly this.

An important kind of program to automate is a web site that provides some useful program or data online. Using the appropriate Perl modules, you can connect to the web site, send it your input, collect the output, and then parse and reformat as you wish. It's actually not hard to do! O'Reilly's Perl Cookbook, a companion volume to Programming Perl, is an excellent source of short programs and helpful descriptions to get you started.

Perl is a great way to automate other programs. The next section shows an example of a Perl program that starts another program and collects, parses, reformats, and outputs the results. This program will control another program on the same computer. The example will be from a Unix or Linux environment; consult your Perl documentation on how to get the same functionality from your Windows or Macintosh platform.

11.5.1 The Stride Secondary Structure Predictor

We will use an external program to calculate the secondary structure from the 3D coordinates of a PDB file. As a secondary structure assignment engine, I use a program that outputs a secondary structure report, called stride. stride is available from EMBL ( and runs on Unix, Linux, Windows, Macintosh, and VMS systems. The program works very simply; just give it a command-line argument of a PDB filename and collect the output in the subroutine call_stride that follows.

Example 11-7 is the entire program: two subroutines and a main program, followed by a discussion.

Example 11-7. Call another program for secondary structure prediction


# Call another program to perform secondary structure prediction
use strict;

use warnings;

# Call "stride" on a file, collect the report

my(@stride_output) = call_stride('pdb/c1/pdb1c1f.ent');

# Parse the stride report into primary sequence, and secondary

# structure prediction

my($sequence, $structure) = parse_stride(@stride_output);
# Print out the beginnings of the sequence and the secondary structure

print substr($sequence, 0, 80), "\n";

print substr($structure, 0, 80), "\n";

# Subroutine for Example 11-7

# call_stride


# --given a PDB filename, return the output from the "stride"

# secondary structure prediction program
sub call_stride {
use strict;

use warnings;

my($filename) = @_;
# The stride program options

my($stride) = '/usr/local/bin/stride';

my($options) = '';

my(@results) = ( );

# Check for presence of PDB file

unless ( -e $filename ) {

print "File \"$filename\" doesn\'t seem to exist!\n";



# Start up the program, capture and return the output

@results = `$stride $options $filename`;

return @results;

# parse_stride


#--given stride output, extract the primary sequence and the

# secondary structure prediction, returning them in a

# two-element array.

sub parse_stride {
use strict;

use warnings;

my(@stridereport) = @_;

my($seq) = '';

my($str) = '';

my $length;

# Extract the lines of interest

my(@seq) = grep(/^SEQ /, @stridereport);

my(@str) = grep(/^STR /, @stridereport);
# Process those lines to discard all but the sequence

# or structure information

for (@seq) { $_ = substr($_, 10, 50) }

for (@str) { $_ = substr($_, 10, 50) }

# Return the information as an array of two strings

$seq = join('', @seq);

$str = join('', @str);
# Delete unwanted spaces from the ends of the strings.

# ($seq has no spaces that are wanted, but $str may)

$seq =~ s/(\s+)$//;
$length = length($1);
$str =~ s/\s{$length}$//;
return( ($seq, $str) );


As you can see in the subroutine call_stride, variables have been made for the program name ($stride) and for the options you may want to pass ($options). Since these are parts of the program you may want to change, put them as variables near the top of the code, to make them easy to find and alter. The argument to the subroutine is the PDB filename ($filename). (Of course, if you expect the options to change frequently, you can make them another argument to the subroutine.)

Since you're dealing with a program that takes a file, do a little error checking to see if a file by that name actually exists. Use the -e file test operator. Or you can omit this and let the stride program figure it out, and capture its error output. But that requires parsing the stride output for its error output, which involves figuring out how stride reports errors. This can get complicated, so I'd stick with using the -e file test operator.

The actual running of the program and collecting its output happens in just one line. The program to be run is enclosed in backticks, which run the program (first expanding variables) and return the output as an array of lines.

There are other ways to run a program. One common way is the system function call. It behaves differently from the backticks: it doesn't return the output of the command it calls (it just returns the exit status, an integer indicating success or failure of the command). Other methods include qx , the open system call, and the fork and exec functions.

11.5.2 Parsing Stride Output

I don't go into too much detail here about parsing the output of stride. Let's just exhibit some code that extracts the primary sequence and the secondary structure prediction. See the exercises at the end of the chapter for a challenge to extract the secondary structure information from a PDB file's HELIX, SHEET, and TURN record types and output the information in a similar format as stride does here.

Here is a typical section of a stride output (not the entire output):




REM . . . . . 1A4O




REM . . . . . 1A4O




REM . . . . 1A4O



Notice that each line is prefaced by an identifier, which should make collecting the different record types easy. Without even consulting the documentation (a slightly dangerous but sometimes expedient approach), you can see that the primary sequence has keyword SEQ, the structure prediction has keyword STR, and the data of interest lies from position 11 up to position 60 on each line. (We'll ignore everything else for now.)

The following list shows the one-letter secondary structure codes used by stride:


Alpha helix


3-10 helix


PI helix


Extended conformation

B or b

Isolated bridge




Coil (none of the above)

Using the substr function, the two for loops alter each line of the two arrays by saving the 11th to the 60th positions of those strings. This is where the desired information lies.

Now let's examine the subroutine parse_stride in Example 11-7 that takes stride output and returns an array of two strings, the primary sequence and the structure prediction.

This is a very "Perlish" subroutine that uses some features that manipulate text. What's interesting is the brevity of the program, which some of Perl's built-in functions make possible.

First, you receive the output of the stride program in the subroutine argument @_. Next, use the grep function to extract those lines of interest, which are easy to identify in this output, as they begin with clear identifiers SEQ and STR.

Next, you want to save just those positions (or columns) of these lines that have the sequence or structure information; you don't need the keywords, position numbers, or the PDB entry name at the end of the lines.

Finally, join the arrays into single strings. Here, there's one detail to handle; you need to remove any unneeded spaces from the ends of the strings. Notice that stride sometimes leaves spaces in the structure prediction, and in this example, has left some at the end of the structure prediction. So you shouldn't throw away all the spaces at the ends of the strings. Instead, throw away all the spaces at the end of the sequence string, because they are just superfluous spaces on the line. Now, see how many spaces that was, and throw the equal amount away at the end of the structure prediction string, thus preserving spaces that correspond to undetermined secondary structure.

Example 11-7 contains a main program that calls two subroutines, which, since they are short, are all included (so there's no need here for the BeginPerlBioinfo module). Here's the output of Example 11-7:



The first line shows the amino acids, and the second line shows the prediction of the secondary structure. Check the next section for a subroutine that will improve that output.

11.6 Exercises
Exercise 11.1

Use File::Find and the file test operators to find the oldest and largest files on the hard drive of your computer. (You can delete them or store them elsewhere if you're running short on disk space.)

Exercise 11.2

Find all the Perl programs on your computer.

Hint: Use File::Find. What do all Perl programs have in common?

Exercise 11.3

Parse the HEADER, TITLE, and KEYWORDS record types of all PDB files on your computer. Make a hash with key as a word from those record types and value as a list of filenames that contained that word. Save it as a DBM file and build a query program for it. In the end, you should be able to ask for, say, sugar, and get a list of all PDB files that contain that word in the HEADER, TITLE, or KEYWORDS records.

Exercise 11.4

Parse out the record types of a PDB file using regular expressions (as used in Chapter 10) instead of iterating through an array of input lines (as in this chapter.)

Exercise 11.5

Write a program that extracts the secondary structure information contained in the HELIX, SHEET, and TURN record types of PDB files. Print out the secondary structure and the primary sequence together, so that it's easy to see by what secondary structure a given residue is included. (Consider using a special alphabet for secondary structure, so that every residue in a helix is represented by H, for example.)

Exercise 11.6

Write a program that finds all PDB files under a given folder and runs a program (such as stride, or the program you wrote in Exercise 11.5) that reports on the secondary structure of each PDB file. Store the results in a DBM file keyed on the filename.

Exercise 11.7

Write a subroutine that, given two strings, prints them out one over the other, but with line breaks (similar to the stride program output). Use this subroutine to print out the strings from Example 11-7.

Exercise 11-8

Write a recursive subroutine to determine the size of an array. You may want to use the pop or unshift functions. (Ignore the fact that the scalar @ array returns the size of @array!)

Exercise 11.9

Write a recursive subroutine that extracts the primary amino acid sequence from the SEQRES record type of a PDB file.

Exercise 11.10

(Extra credit) Given an atom and a distance, find all other atoms in a PDB file that are within that distance of the atom.

Exercise 11.11

(Extra credit) Write a program to find some correlation between the primary amino acid sequence and the location of alpha helices.

Chapter 12. BLAST

In biological research, the search for sequence similarity is very important. For instance, a researcher who has discovered a potentially important DNA or protein sequence wants to know if it's already been identified and characterized by another researcher. If it hasn't, the researcher wants to know if it resembles any known sequence from any organism. This information can provide vital clues as to the role of the sequence in the organism.

The Basic Local Alignment Search Tool (BLAST) is one of the most popular software tools in biological research. It tests a query sequence against a library of known sequences in order to find similarity. BLAST is actually a collection of programs with versions for query-to-database pairs such as nucleotide-nucleotide, protein-nucleotide, protein-protein, nucleotide-protein, and more.

This chapter examines the output from the nucleotide-nucleotide version of the program, BLASTN . For simplicity's sake, I'll simply refer to it here as BLAST. The main goal of this chapter is to show how to write code to parse a BLAST output file using regular expressions. The code is simple and basic, but it does the job. Once you understand the basics, you can build more features into your parser or obtain one of the fancier BLAST output parsers that's available via the Web. In either case, you'll know enough about output parsers to use or extend them.

This chapter also gives you a brief introduction to Bioperl, which is a collection of Perl bioinformatics modules. The Bioperl project is an example of an open source project that you, the Perl bioinformatics programmer, can put to good use. The Perl programming language is itself an open source project. The program and its source code are available for use and modification with only very reasonable restrictions and at no cost.

12.1 Obtaining BLAST

There are a several implementations of BLAST. The most popular is probably the one offered free of charge by the National Center for Biotechnology Information (NCBI): The NCBI web site features a publicly available BLAST server, a comprehensive set of databases, and a well-organized collection of documents and tutorials, in addition to the BLAST software available for downloading.

Also popular is the WU-BLAST implementation from Washington University. The main web site, including a list of other WU-BLAST servers, can be found at Older versions of WU-BLAST are available at no charge. Newer versions are free if you qualify as a research or nonprofit organization and agree to the licensing arrangements from Washington University where the program is developed and maintained. If you work at a major research organization, you may already have a site license for the WU-BLAST program. If you are a for-profit company, there is a rather hefty charge for the newer WU-BLAST program (older versions are freely available if you want to run BLAST on your own computer). Pennsylvania State University also develops some BLAST programs, available at In addition to NCBI and WU-BLAST, many other BLAST server web sites are available. A Google search ( on "BLAST server" will bring up many hits.

A big question that faces researchers when they use BLAST is whether to use a public BLAST server or to run it locally. There are significant advantages to using a public server, the largest being that the databases (such as GenBank) used by the BLAST server are always up to date. To keep your own up-to-date copy of these databases requires a significant amount of hard-disk space, a computer with a fairly high-end processor and a lot of memory (to run the BLAST engine), a high-capacity network link, and a lot of time setting up and overseeing the software that updates the databases. On the other hand, perhaps you have your own library of sequences that you want to use in BLAST searches, you do frequent or large searches, or you have other reasons to run your own in-house BLAST engine. If that's the case, it makes sense to invest in the hardware and run it locally.

The online documentation for BLAST is fairly extensive and includes details on the statistical methods the program uses to calculate similarity. In the next section, I touch briefly on some of those points, but you should refer to the BLAST home page and to the excellent material at the NCBI web site for the whole story and detailed references. Our interest here is not the theory, but rather to parse the output of the program.

12.2 String Matching and Homology

String matching is the computer-science term for algorithms that find one string embedded in another. It has a fairly long and fruitful history, and many string-matching algorithms have been developed using a variety of techniques and for different cases. (See the Gusfield book in Appendix A for an excellent treatment with a biological emphasis.) We've already done a fair amount of string matching, using the binding operator to search for motifs and other text with regular expressions.

BLAST is basically a string-matching program. Details of the string-matching algorithms, and of the algorithms used in BLAST in particular, are beyond the scope of this book. But first I want to define some terms that are frequently confused or used interchangeably. I also briefly introduce the BLAST statistics.

Biological string matching looks for similarity as an indication of homology. Similarity between the query and the sequences in the database may be measured by the percent identity, or the number of bases in the query that exactly match a corresponding region of a sequence from the database. It may also be measured by the degree of conservation, which finds matches between equivalent (redundant) codons or between amino acid residues with similar properties that don't alter the function of a protein (see Chapter 8). Homology between sequences means the sequences are related evolutionarily. Two sequences are or are not homologous; there's no degree of homology.

At the risk of oversimplifying a complex topic, I'll summarize a few facts about BLAST statistics. (See the BLAST documentation for a complete picture.) The output of a BLAST search reports a set of scores and statistics on the matches it has found based on the raw score S, various parameters of the scoring algorithm, and properties of the query and database. The raw score S is a measure of similarity and the size of the match. The BLAST output lists the hits ranked by their E value. The E (expect) value of a match measures, roughly, the chances that the string matching (allowing for gaps) occurs in a randomly generated database of the same size and composition. The closer to 0 the E value is, the less likely it occurred by chance. In other words, the lower the E value, the better the match. As a general rule of thumb for BLASTN, an E value less than 1 may be a solid hit, and an E value of less than 10 may be worth looking at, but this is not a hard and fast rule. (Of course, proteins can be homologous with even a very small percent identity; the percent similarity is typically higher for homologous DNA.)

Now that you have the basics, let's write code to parse BLAST output. First, you separate the hits, then extract the sequence, and finally, you find the annotation showing the E value statistic.

12.3 BLAST Output Files

The following is part of a BLAST output file. I created it by entering a few lines of the sample.dna file from Chapter 8 into the BLAST program at the NCBI web site, without changing any of the default parameters. I then saved the output as text in the file blst.txt, which is available from this book's web site. I've used it repeatedly in the parsing routines throughout this chapter. Because the output is several pages long, I've truncated it here to show the beginning, the middle, and the end of the file.

BLASTN 2.1.3 [Apr-11-2001]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,

Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),

"Gapped BLAST and PSI-BLAST: a new generation of protein database search

programs", Nucleic Acids Res. 25:3389-3402.

RID: 991533563-27495-9092


(400 letters)
Database: nt

868,831 sequences; 3,298,558,333 total letters

Score E

Sequences producing significant alignments: (bits) Value

dbj|AB031069.1|AB031069 Homo sapiens PCCX1 mRNA for protein cont... 793 0.0

ref|NM_014593.1| Homo sapiens CpG binding protein (CGBP), mRNA 779 0.0

gb|AF149758.1|AF149758 Homo sapiens CpG binding protein (CGBP) m... 779 0.0

ref|XM_008699.3| Homo sapiens CpG binding protein (CGBP), mRNA 765 0.0

emb|AL136862.1|HSM801830 Homo sapiens mRNA; cDNA DKFZp434F174 (f... 450 e-124

emb|AJ132339.1|HSA132339 Homo sapiens CpG island sequence, subcl... 446 e-123

emb|AJ236590.1|HSA236590 Homo sapiens chromosome 18 CpG island D... 406 e-111

dbj|AK010337.1|AK010337 Mus musculus ES cells cDNA, RIKEN full-l... 234 3e-59

dbj|AK017941.1|AK017941 Mus musculus adult male thymus cDNA, RIK... 210 5e-52

gb|AC009750.7|AC009750 Drosophila melanogaster, chromosome 2L, r... 46 0.017

gb|AE003580.2|AE003580 Drosophila melanogaster genomic scaffold ... 46 0.017

ref|NC_001905.1| Leishmania major chromosome 1, complete sequence 40 1.0

gb|AE001274.1|AE001274 Leishmania major chromosome 1, complete s... 40 1.0

gb|AC008299.5|AC008299 Drosophila melanogaster, chromosome 3R, r... 38 4.1

gb|AC018662.3|AC018662 Human Chromosome 7 clone RP11-339C9, comp... 38 4.1

gb|AE003774.2|AE003774 Drosophila melanogaster genomic scaffold ... 38 4.1

gb|AC008039.1|AC008039 Homo sapiens clone SCb-391H5 from 7q31, c... 38 4.1

gb|AC005315.2|AC005315 Arabidopsis thaliana chromosome II sectio... 38 4.1

emb|AL353748.13|AL353748 Human DNA sequence from clone RP11-317B... 38 4.1


>dbj|AB031069.1|AB031069 Homo sapiens PCCX1 mRNA for protein containing CXXC

domain 1,

complete cds

Length = 2487
Score = 793 bits (400), Expect = 0.0

Identities = 400/400 (100%)

Strand = Plus / Plus
Query: 1 agatggcggcgctgaggggtcttgggggctctaggccggccacctactggtttgcagcgg 60


Sbjct: 1 agatggcggcgctgaggggtcttgggggctctaggccggccacctactggtttgcagcgg 60
Query: 61 agacgacgcatggggcctgcgcaataggagtacgctgcctgggaggcgtgactagaagcg 120


Sbjct: 61 agacgacgcatggggcctgcgcaataggagtacgctgcctgggaggcgtgactagaagcg 120
Query: 121 gaagtagttgtgggcgcctttgcaaccgcctgggacgccgccgagtggtctgtgcaggtt 180


Sbjct: 121 gaagtagttgtgggcgcctttgcaaccgcctgggacgccgccgagtggtctgtgcaggtt 180
Query: 181 cgcgggtcgctggcgggggtcgtgagggagtgcgccgggagcggagatatggagggagat 240


Sbjct: 181 cgcgggtcgctggcgggggtcgtgagggagtgcgccgggagcggagatatggagggagat 240
Query: 241 ggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggggagaat 300


Sbjct: 241 ggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggggagaat 300

Query: 301 gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcgggtgtgac 360


Sbjct: 301 gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcgggtgtgac 360
Query: 361 aactgcaatgagtggttccatggggactgcatccggatca 400


Sbjct: 361 aactgcaatgagtggttccatggggactgcatccggatca 400
>ref|NM_014593.1| Homo sapiens CpG binding protein (CGBP), mRNA

... (file truncated here)

>dbj|AK010337.1|AK010337 Mus musculus ES cells cDNA, RIKEN full-length enriched library, clone:2410002I16, full insert sequence Length = 2538 Score = 234 bits (118), Expect = 3e-59 Identities = 166/182 (91%) Strand = Plus / Plus Query: 219 gagcggagatatggagggagatggttcagacccagagcctccagatgccggggaggacag 278 ||||||||||||||| |||||||| ||||||| || ||||| ||||||||||| ||||| Sbjct: 260 gagcggagatatggaaggagatggctcagacctggaacctccggatgccggggacgacag 319 Query: 279 caagtccgagaatggggagaatgcgcccatctactgcatctgccgcaaaccggacatcaa 338 |||||| |||||||||||||| || ||||||||||||||||| ||||||||||||||||| Sbjct: 320 caagtctgagaatggggagaacgctcccatctactgcatctgtcgcaaaccggacatcaa 379 Query: 339 ctgcttcatgatcgggtgtgacaactgcaatgagtggttccatggggactgcatccggat 398 ||||||||||| || |||||||||||||| |||||||||||||| |||||||||||||| Sbjct: 380 ttgcttcatgattggatgtgacaactgcaacgagtggttccatggagactgcatccggat 439 Query: 399 ca 400 || Sbjct: 440 ca 441 Score = 44.1 bits (22), Expect = 0.066 Identities = 25/26 (96%) Strand = Plus / Plus Query: 118 gcggaagtagttgtgggcgcctttgc 143 ||||||||||||| |||||||||||| Sbjct: 147 gcggaagtagttgcgggcgcctttgc 172 >dbj|AK017941.1|AK017941 Mus musculus adult male thymus cDNA, RIKEN full-length enriched library, clone:5830420C16, full insert sequence Length = 1461 Score = 210 bits (106), Expect = 5e-52 Identities = 151/166 (90%) Strand = Plus / Plus Query: 235 ggagatggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggg 294 |||||||| ||||||| || ||||| ||||||||||| ||||||||||| ||||||||| Sbjct: 1048 ggagatggctcagacctggaacctccggatgccggggacgacagcaagtctgagaatggg 1107 Query: 295 gagaatgcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcggg 354 ||||| || ||||||||||||||||| ||||||||||||||||| ||||||||||| || Sbjct: 1108 gagaacgctcccatctactgcatctgtcgcaaaccggacatcaattgcttcatgattgga 1167 Query: 355 tgtgacaactgcaatgagtggttccatggggactgcatccggatca 400 |||||||||||||| |||||||||||||| |||||||||||||||| Sbjct: 1168 tgtgacaactgcaacgagtggttccatggagactgcatccggatca 1213 Score = 44.1 bits (22), Expect = 0.066 Identities = 25/26 (96%) Strand = Plus / Plus Query: 118 gcggaagtagttgtgggcgcctttgc 143 ||||||||||||| |||||||||||| Sbjct: 235 gcggaagtagttgcgggcgcctttgc 260 >gb|AC009750.7|AC009750 Drosophila melanogaster, chromosome 2L, region 23F-24A, BAC clone ...

(file truncated here)

>emb|AL353748.13|AL353748 Human DNA sequence from clone RP11-317B17 on chromosome 9, complete sequence [Homo sapiens] Length = 179155 Score = 38.2 bits (19), Expect = 4.1 Identities = 22/23 (95%) Strand = Plus / Plus Query: 192 ggcgggggtcgtgagggagtgcg 214 |||| |||||||||||||||||| Sbjct: 48258 ggcgtgggtcgtgagggagtgcg 48280 Database: nt Posted date: May 30, 2001 3:54 AM Number of letters in database: -996,408,959 Number of sequences in database: 868,831 Lambda K H 1.37 0.711 1.31 Gapped Lambda K H 1.37 0.711 1.31 Matrix: blastn matrix:1 -3 Gap Penalties: Existence: 5, Extension: 2 Number of Hits to DB: 436021 Number of Sequences: 868831 Number of extensions: 436021 Number of successful extensions: 7536 Number of sequences better than 10.0: 19 length of query: 400 length of database: 3,298,558,333 effective HSP length: 20 effective length of query: 380 effective length of database: 3,281,181,713 effective search space: 1246849050940 effective search space used: 1246849050940 T: 0 A: 30 X1: 6 (11.9 bits) X2: 15 (29.7 bits) S1: 12 (24.3 bits) S2: 19 (38.2 bits)

As you can see, the file consists of three parts: some header information at the beginning followed by a summary of the alignments, the alignments, and then some additional summary parameters and statistics at the end.
12.4 Parsing BLAST Output

So why parse BLAST output? One reason is to see if your DNA has any new matches against the DNA stored in the constantly growing databases. You can write a program to automatically perform a daily BLAST search and compare its results with those of the previous day by parsing the summary list of hits and comparing it with the previous day's summary list. You can then have the program email you if something new has turned up.

12.4.1 Extracting Annotation and Alignments

Example 12-1 consists of a main program and two new subroutines. The subroutines—parse_blast and parse_blast_alignment—use regular expressions to extract the various bits of data from a scalar string. I chose this method because the data, although structured, does not clearly identify each line with its function. (See the discussion in Chapter 10 and Chapter 11.)

Example 12-1. Extract annotation and alignments from BLAST output file


# Extract annotation and alignments from BLAST output file
use strict;

use warnings;

use BeginPerlBioinfo; # see Chapter 6 about this module
# declare and initialize variables

my $beginning_annotation = '';

my $ending_annotation = '';

my %alignments = ( );

my $filename = 'blast.txt';
parse_blast(\$beginning_annotation, \$ending_annotation, \%alignments, $filename);
# Print the annotation, and then

# print the DNA in new format just to check if we got it okay.

print $beginning_annotation;
foreach my $key (keys %alignments) {

print "$key\nXXXXXXXXXXXX\n", $alignments{$key}, "\nXXXXXXXXXXX\n";

print $ending_annotation;

# Subroutines for Example 12-1

# parse_blast


# --parse beginning and ending annotation, and alignments,

# from BLAST output file

sub parse_blast {
my($beginning_annotation, $ending_annotation, $alignments, $filename) = @_;
# $beginning_annotation--reference to scalar

# $ending_annotation --reference to scalar

# $alignments --reference to hash

# $filename --scalar

# declare and initialize variables

my $blast_output_file = '';

my $alignment_section = '';

# Get the BLAST program output into an array from a file

$blast_output_file = join( '', get_file_data($filename));
# Extract the beginning annotation, alignments, and ending annotation

($$beginning_annotation, $alignment_section, $$ending_annotation)

= ($blast_output_file =~ /(.*^ALIGNMENTS\n)(.*)(^ Database:.*)/ms);

# Populate %alignments hash

# key = ID of hit

# value = alignment section

%$alignments = parse_blast_alignment($alignment_section);

# parse_blast_alignment


# --parse the alignments from a BLAST output file,

# return hash with

# key = ID

# value = text of alignment
sub parse_blast_alignment {
my($alignment_section) = @_;

# declare and initialize variables

my(%alignment_hash) = ( );
# loop through the scalar containing the BLAST alignments,

# extracting the ID and the alignment and storing in a hash


# The regular expression matches a line beginning with >,

# and containing the ID between the first pair of | characters;

# followed by any number of lines that don't begin with >

while($alignment_section =~ /^>.*\n(^(?!>).*\n)+/gm) {

my($value) = $&;

my($key) = (split(/\|/, $value)) [1];

$alignment_hash{$key} = $value;


return %alignment_hash;


The main program does little more than call the parsing subroutine and print the results. The arguments, initialized as empty, are passed by reference (see Chapter 6).

The subroutine parse_blast does the top-level parsing job of separating the three sections of a BLAST output file: the annotation at the beginning, the alignments in the middle, and the annotation at the end. It then calls the parse_blast_alignment subroutine to extract the individual alignments from that middle alignment section. The data is first read in from the named file with our old friend the get_file_data subroutine from Chapter 8. Use the join function to store the array of file data into a scalar string.

The three sections of the BLAST output file are separated by the following statement:

($$beginning_annotation, $alignment_section, $$ending_annotation)
= ($blast_output_file =~ /(.*^ALIGNMENTS\n)(.*)(^ Database:.*)/ms);

The pattern match contains three parenthesized expressions:


which is returned into $$beginning_annotation;


which is saved in $alignment_section; and:

(^ Database:.*)

which is saved in $$ending_annotation.

The use of $$ instead of $ at the beginning of two of these variables indicates that they are references to scalar variables. Recall that they were passed in as arguments to the subroutine, where they were preceded by a backslash, like so:

parse_blast(\$beginning_annotation, \$ending_annotation, \%alignments, $filename);

You've seen references to variables before, starting in Chapter 6. Let's review them briefly. Within the parse_blast subroutine, those variables with only one $ are references to the scalar variables. They need an additional $ to represent actual scalar variables. This is how references work; they need an additional special character to indicate what kinds of variables they are references to. So a reference to a scalar variable needs to start with $$, a reference to an array variable needs to start with @$, and a reference to a hash variable needs to start with %$.

The regular expression in the previous code snippet matches everything up to the word ALIGNMENTS at the end of a line (.*^ALIGNMENTS\n); then everything for a while (.*); then a line that begins with two spaces and the word Database: followed by the rest of the file (^ Database:.*). These three expressions in parentheses correspond to the three desired parts of the BLAST output file; the beginning annotation, the alignment section, and the ending annotation.

The alignments saved in $alignment_section are separated out by the subroutine parse_blast_alignment. This subroutine has one important loop:

while($alignment_section =~ /^>.*\n(^(?!>).*\n)+/gm) {

my($value) = $&;

my($key) = (split(/\|/, $value)) [1];

$alignment_hash{$key} = $value;


You're probably thinking that this regular expression looks downright evil. At first glance, regular expressions do sometimes seem incomprehensible, so let's take a closer look. There are a few new things to examine.

The five lines comprise a while loop, which (due to the global /g modifier on the pattern match in the while loop) keeps matching the pattern as many times as it appears in the string. Each time the program cycles through the loop, the pattern match finds the value (the entire alignment), then determines the key. The key and values are saved in the hash %alignment_hash.

Here's an example of one of the matches that's found by this while loop when parsing the BLAST output shown in Section 12.3:

>emb|AL353748.13|AL353748 Human DNA sequence from clone RP11-317B17 on

chromosome 9, complete

sequence [Homo sapiens]

Length = 179155

Score = 38.2 bits (19), Expect = 4.1

Identities = 22/23 (95%)

Strand = Plus / Plus

Query: 192 ggcgggggtcgtgagggagtgcg 214

|||| ||||||||||||||||||

Sbjct: 48258 ggcgtgggtcgtgagggagtgcg 48280

This text starts with a line beginning with a > character. In the complete BLAST output, sections like these follow one another. What you want to do is start matching from a line beginning with > and include all following adjacent lines that don't start with a > character. You also want to extract the identifier, which appears between the first and second vertical bar | characters on the first line (e.g., AL353748.13 in this alignment).

Let's dissect the regular expression:

$alignment_section =~ /^>.*\n(^(?!>).*\n)+/gm

This pattern match, which appears in a while loop within the code, has the modifier m for multiline. The m modifier allows ^ to match any beginning-of-line inside the multiline string, and $ to match any end-of-line.

The regular expression breaks down as follows. The first part is:


It looks for > at the beginning of the BLAST output line, followed by .*, which matches any quantity of anything (except newlines), up to the first newline. In other words, it matches the first line of the alignment.

Here's the rest of the regular expression:


After the ^ which matches the beginning of the line, you'll see a negative lookahead assertion, (?!>), which ensures that a > doesn't follow. Next, the .* matches all non-newline characters, up to the final \n at the end of the line. All of that is wrapped in parentheses with a surrounding +, so that you match all the available lines.

Now that you've matched the entire alignment, you want to extract the key and populate the hash with your key and value. Within the while loop, the alignment that you just matched is automatically set by Perl as the value of the special variable $& and saved in the variable $value. Now you need to extract your key from the alignment. It can be found on the first line of the alignment stored in $value, between the first and second | symbols.

Extracting this identifying key is done using the split function, which breaks the string into an array. The call to split:

split(/\|/, $value)

splits $value into pieces delimited by | characters. That is, the | symbol is used to determine where one list element ends and the next one begins. (Remember that the vertical bar | is a metacharacter and must be escaped as \|.) By surrounding the call to split with parentheses and adding an array offset ([1]), you can isolate the key and save it into $key.

Let's step back now and look at Example 12-1 in its entirety. Notice that it's very short—barely more than two pages, including comments. Although it's not an easy program, due to the complexity of the regular expressions involved, you can make sense of it if you put a little effort into examining the BLAST output files and the regular expressions that parse it.

Regular expressions have lots of complex features, but as a result, they can do lots of useful things. As a Perl programmer, the effort you put into learning them is well worth it and can have significant payoffs down the road.

12.4.2 Parsing BLAST Alignments

Let's take the parsing of the BLAST output file a little further. Notice that some of the alignments include more than one aligned string—for instance, the alignment for ID AK017941.1, shown again here:

>dbj|AK017941.1|AK017941 Mus musculus adult male thymus cDNA, RIKEN

full-length enriched

library, clone:5830420C16, full insert sequence

Length = 1461

Score = 210 bits (106), Expect = 5e-52

Identities = 151/166 (90%)

Strand = Plus / Plus
Query: 235 ggagatggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggg 294

|||||||| ||||||| || ||||| ||||||||||| ||||||||||| |||||||||

Sbjct: 1048 ggagatggctcagacctggaacctccggatgccggggacgacagcaagtctgagaatggg 1107
Query: 295 gagaatgcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcggg 354

||||| || ||||||||||||||||| ||||||||||||||||| ||||||||||| ||

Sbjct: 1108 gagaacgctcccatctactgcatctgtcgcaaaccggacatcaattgcttcatgattgga 1167
Query: 355 tgtgacaactgcaatgagtggttccatggggactgcatccggatca 400

|||||||||||||| |||||||||||||| ||||||||||||||||

Sbjct: 1168 tgtgacaactgcaacgagtggttccatggagactgcatccggatca 1213
Score = 44.1 bits (22), Expect = 0.066

Identities = 25/26 (96%)

Strand = Plus / Plus

Query: 118 gcggaagtagttgtgggcgcctttgc 143

||||||||||||| ||||||||||||

Sbjct: 235 gcggaagtagttgcgggcgcctttgc 260

To parse these alignments, we have to parse out each of the matched strings, which in BLAST terminology are called high-scoring pairs (HSPs).

Each HSP also contains some annotation, and then the HSP itself. Let's parse each HSP into annotation, query string, and subject string, together with the starting and ending positions of the strings. More parsing is possible; you can extract specific features of the annotation, as well as the locations of identical and nonidentical bases in the HSP, for instance.

Example 12-2 includes a pair of subroutines; one to parse the alignments into their HSPs, and the second to extract the sequences and their end positions. The main program extends Example 12-1 using these new subroutines.

Example 12-2. Parse alignments from BLAST output file


# Parse alignments from BLAST output file
use strict;

use warnings;

use BeginPerlBioinfo; # see Chapter 6 about this module
# declare and initialize variables

my $beginning_annotation = '';

my $ending_annotation = '';

my %alignments = ( );

my $alignment = '';

my $filename = 'blast.txt';

my @HSPs = ( );

my($expect, $query, $query_range, $subject, $subject_range) = ('','','','','');

parse_blast(\$beginning_annotation, \$ending_annotation, \%alignments, $filename);
$alignment = $alignments{'AK017941.1'};
@HSPs = parse_blast_alignment_HSP($alignment);
($expect, $query, $query_range, $subject, $subject_range)

= extract_HSP_information($HSPs[1]);

# Print the results

print "\n-> Expect value: $expect\n";

print "\n-> Query string: $query\n";

print "\n-> Query range: $query_range\n";

print "\n-> Subject String: $subject\n";

print "\n-> Subject range: $subject_range\n";


# Subroutines for Example 12-2


# parse_blast_alignment_HSP


# --parse beginning annotation, and HSPs,

# from BLAST alignment

# Return an array with first element set to the beginning annotation,

# and each successive element set to an HSP
sub parse_blast_alignment_HSP {
my($alignment ) = @_;
# declare and initialize variables

my $beginning_annotation = '';

my $HSP_section = '';

my @HSPs = ( );

# Extract the beginning annotation and HSPs

($beginning_annotation, $HSP_section )

= ($alignment =~ /(.*?)(^ Score =.*)/ms);
# Store the $beginning_annotation as the first entry in @HSPs

push(@HSPs, $beginning_annotation);

# Parse the HSPs, store each HSP as an element in @HSPs

while($HSP_section =~ /(^ Score =.*\n)(^(?! Score =).*\n)+/gm) {

push(@HSPs, $&);

# Return an array with first element = the beginning annotation,

# and each successive element = an HSP


# extract_HSP_information


# --parse a HSP from a BLAST output alignment section

# - return array with elements:

# Expect value

# Query string

# Query range

# Subject string

# Subject range

sub extract_HSP_information {
my($HSP) = @_;

# declare and initialize variables

my($expect) = '';

my($query) = '';

my($query_range) = '';

my($subject) = '';

my($subject_range) = '';
($expect) = ($HSP =~ /Expect = (\S+)/);
$query = join ( '' , ($HSP =~ /^Query.*\n/gm) );
$subject = join ( '' , ($HSP =~ /^Sbjct.*\n/gm) );
$query_range = join('..', ($query =~ /(\d+).*\D(\d+)/s));
$subject_range = join('..', ($subject =~ /(\d+).*\D(\d+)/s));
$query =~ s/[^acgt]//g;
$subject =~ s/[^acgt]//g;
return ($expect, $query, $query_range, $subject, $subject_range);


Example 12-2 gives the following output:

-> Expect value: 5e-52

-> Query string: ggagatggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggg


-> Query range: 235..400
-> Subject String: ctggagatggctcagacctggaacctccggatgccggggacgacagcaagtctgagaatg


-> Subject range: 1048..1213

Let's discuss the new features of Example 12-2 and its subroutines. First notice that the two new subroutines from Example 12-1 have been placed into the module, so they aren't printed again here.

The main program, Example 12-2, starts the same as Example 12-1; it calls the parse_blast subroutine to separate the annotation from the alignments in the BLAST output file.

The next line fetches one of the alignments from the %alignments hash, which is then used as the argument to the parse_blast_alignment_HSP subroutine, which then returns an array of annotation (as the first element) and HSPs in @HSPs. Here you see that not only can a subroutine return an array on a scalar value; it can also return a hash.

Finally, Example 12-2 does the lower-level parsing of an individual HSP by calling the extract_HSP_information subroutine, and the extracted parts of one of the HSPs are printed.

Example 12-2 shows a certain inconsistency in our design. Some subroutines call their arguments by reference; others call them by value (see Chapter 6). You may ask: is this a bad thing?

The answer is: not necessarily. The subroutine parse_blast mixes several arguments, and one of them is not a scalar type. Recall that this is a potentially good place to use call-by-reference in Perl. The other subroutines don't mix argument types this way. However, they can be designed to call their arguments by reference.

Continuing with the code, let's examine the subroutine parse_blast_alignment_HSP . This takes one of the alignments from the BLAST output and separates out the individual HSP string matches. The technique used is, once again, regular expressions operating on a single string that contains all the lines of the alignment given as the input argument.

The first regular expression parses out the annotation and the section containing the HSPs:

($beginning_annotation, $HSP_section )

= ($alignment =~ /(.*?)(^ Score =.*)/ms);

The first parentheses in the regular expression is (.*?) This is the nongreedy or minimal matching mentioned in Chapter 9. Here it gobbles up everything before the first line that begins Score = (without the ? after the *, it would gobble everything until the final line that begins Score =). This is the exact dividing line between the beginning annotation and the HSP string matches.

The next loop and regular expression separates the individual HSP string matches:

while($HSP_section =~ /(^ Score =.*\n)(^(?! Score =).*\n)+/gm) {

push(@HSPs, $&);


This is the same kind of global string match in a while loop you've seen before; it keeps iterating as long as the match can be found. The other modifier /m is the multiline modifier, which enables the metacharacters $ and ^ to match before and after embedded newlines.

The expression within the first pair of parentheses—(^ Score =.*\n)—matches a line that begins Score =, which is the kind of line that begins an HSP string match section.

The code within the second pair of parentheses—(^(?! Score =).*\n)+—matches one or more (the + following the other parentheses) lines that do not begin with Score =. The ?! at the beginning of the embedded parentheses is the negative lookahead assertion you encountered in Example 12-1. So, in total, the regular expression captures a line beginning with Score = and all succeeding adjacent lines that don't begin with Score =.

12.5 Presenting Data

Up to now, we've relied on the print statement to format output. In this section, I introduce three additional Perl features for writing output:

  • printf function

  • here documents

  • format and write functions

The entire story about these Perl output features is beyond the scope of this book, but I'll tell you just enough to give you an idea of how they can be used.

12.5.1 The printf Function

The printf function is like the print function but with extra features that allow you to specify how certain data is printed out. Perl's printf function is taken from the C language function of the same name. Here's an example of a printf statement:

my $first = '3.14159265';

my $second = 76;

my $third = "Hello world!";

printf STDOUT "A float: %6.4f An integer: %-5d and a string: %s\n",

$first, $second, $third;

This code snippet prints the following:

A float: 3.1416 An integer: 76 and a string: Hello world!

The arguments to the printf function consist of a format string, followed by a list of values that are printed as specified by the format string. The format string may also contain any text along with the directives to print the list of values. (You may also specify an optional filehandle in the same manner you would a print function.)

The directives consist of a percent sign followed by a required conversion specifier, which in the example includes f for floating point, d for integer, and s for string. The conversion specifier indicates what kind of data is in the variable to be printed. Between the % and the conversion specifier, there may be 0 or more flags, an optional minimum field width, an optional precision, and an optional length modifier. The list of values following the format string must contain data that matches the types of directives, in order.

There are many possible options for these flags and specifiers (some are listed in Appendix B). Here's what is in Example 12-3. First, the directive %6.4f specifies to print a floating point (that is, a decimal) number, with a minimum width of six characters overall (padded with spaces if necessary), and at most four positions for the decimal part. You see in the output that, although the $f floating-point number gives the value of pi to eight decimal places, the example specifies a precision of four decimal places, which are all that is printed out.

The %-5d directive specifies an integer to be printed in a field of width 5; the - flag causes the number to be left-justified in the field. Finally, the %s directive prints a string.

12.5.2 here Documents

Now we'll briefly examine here documents. These are convenient ways to specify multiline text for output with perhaps some variables to be interpolated, in a way that looks pretty much the same in your code as it will in the output—that is, without a lot of print statements or embedded newline \n characters. We'll follow Example 12-3 and its output with a discussion.

Example 12-3. Example of here document


# Example of here document
use strict;

use warnings;


for( my $i = 0 ; $i < 2 ; ++$i ) {

print <

1   ...   20   21   22   23   24   25   26   27   28

The database is protected by copyright © 2017
send message

    Main page