# B.15.2.5 Matching any character with

 Page 28/28 Date conversion 29.03.2017 Size 1.4 Mb.

B.15.2.5 Matching any character with .

The period or dot . represents any character except a newline. (The pattern modifier /s makes it also match a newline.) So, . is like a character class that specifies every character.

B.15.2.6 Beginning and end of strings with ^ and \$

The ^ metacharacter doesn't match a character; rather, it asserts that the item that follows must be at the beginning of the string. Similarly, the \$ metacharacter doesn't match a character but asserts that the item that precedes it must be at the end of the string (or before the final newline). For example: /^Watson and Crick/ matches if the string starts with Watson and Crick; and /Watson and Crick\$/ matches if the string ends with Watson and Crick or Watson and Crick\n.

B.15.2.7 Quantifiers: * + {MIN,} {MIN,MAX} ?

These metacharacters indicate the repetition of an item. The * metacharacter indicates zero, one, or more of the preceding item. The + metacharacter indicates one or more of the preceding item. The brace { } metacharacters let you specify exactly the number of previous items, or a range. For instance, {3} means exactly three of the preceding item; {3,7} means three, four, five, six, or seven of the preceding item; and {3,} means three or more of the preceding item. The ? matches none or one of the preceding item.

B.15.2.8 Making quantifiers match minimally with ?

The quantifiers just shown are greedy (or maximal) by default, meaning that they match as many items as possible. Sometimes, you want a minimal match that will match as few items as possible. You get that by following each of * + {} ? with a ?. So, for instance, *? tries to match as few as possible, perhaps even none, of the preceding item before it tries to match one or more of the preceding item. Here's a maximal match:

'hear ye hear ye hear ye' =~ /hear.*ye/;

print \$&;

This matches 'hear' followed by .* (as many characters as possible), followed by 'ye', and prints:

Here is a minimal match:

'hear ye hear ye hear ye' =~ /hear.*?ye/;

print \$&;

This matches 'hear' followed by .*? (the fewest number of characters possible), followed by 'ye', and prints:

hear ye

B.15.3 Capturing Matched Patterns

You can place parentheses around parts of the pattern for which you want to know the matched string. For example:

\$alphabet = 'abcdefghijklmnopqrstuvwxyz';

\$alphabet =~ /k(lmnop)q/;

print \$1;

prints:

lmnop

You can place as many pairs of parentheses in a regular expression as you like; Perl automatically stores their matched substrings in special variables named \$1, \$2, and so on. The matches are numbered in order of the left-to-right appearance of their opening parenthesis.

Here's a more intricate example of capturing parts of a matched pattern in a string:

\$alphabet = 'abcdefghijklmnopqrstuvwxyz';

\$alphabet =~ /(((a)b)c)/;

print "First pattern = ", \$1,"\n";

print "Second pattern = ", \$2,"\n";

print "Third pattern = ", \$3,"\n";

This prints:

First pattern = abc

Second pattern = ab

Third pattern = a

B.15.4 Metasymbols

Metasymbols are sequences of two or more characters consisting of backslashes before normal characters. These metasymbols have special meanings in Perl regular expressions (and in double-quoted strings for most of them). There are quite a few of them, but that's because they're so useful. Table B-3 lists most of these metasymbols. The column "Atomic" indicates Yes if the metasymbol matches an item, No if the metasymbol just makes an assertion, and - if it takes some other action.

Table B-3. Alphanumeric metasymbols

 Symbol Atomic Meaning \0 Yes Match the null character (ASCII NULL) \NNN Yes Match the character given in octal, up to 377 \n Yes Match nth previously captured string (decimal) \a Yes Match the alarm character (BEL) \A No true at the beginning of a string \b Yes Match the backspace character (BS) \b No True at word boundary \B No True when not at word boundary \cX Yes Match the control character Control-X \d Yes Match any digit character \D Yes Match any nondigit character \e Yes Match the escape character (ASCII ESC, not backslash) \E - End case (\L, \U) or metaquote (\Q) translation \f Yes Match the formfeed character (FF) \G No true at end-of-match position of prior m//g \l - Lowercase the next character only \L - Lowercase till \E \n Yes Match the newline character (usually NL, but CR on Macs) \Q - Quote (do-meta) metacharacters till \E \r Yes Match the return character (usually CR, but NL on Macs) \s Yes Match any whitespace character \S Yes Match any nonwhitespace character \t Yes Match the tab character (HT) \u - Titlecase the next character only \U - Uppercase (not titlecase) till \E \w Yes Match any "word" character (alphanumerics plus _ ) \W Yes Match any nonword character \x{abcd} Yes Match the character given in hexadecimal \z No true at end of string only \Z No true at end of string or before optional newline

B.15.5 Extending Regular-Expression Sequences

Table B-4 includes several useful features that have been added to Perl's regular-expression capabilities.
Table B-4. Extended regular-expression sequences

 Extension Atomic Meaning (?#...) No Comment, discard (?:...) Yes Cluster-only parentheses, no capturing (?imsx-imsx) No Enable/disable pattern modifiers (?imsx-imsx:...) Yes Cluster-only parentheses plus modifiers (?=...) No True if lookahead assertion succeeds (?!...) No True if lookahead assertion fails (?<=...) No True if lookbehind assertion succeeds (? No True if lookbehind assertion fails (?>...) Yes Match nonbacktracking subpattern (?{...}) No Execute embedded Perl code (??{...}) Yes Match regex from embedded Perl code (?(...)...|...) Yes Match with if-then-else pattern (?(...)...) Yes Match with if-then pattern

B.15.6 Pattern Modifiers

Pattern modifiers are single-letter commands placed after the forward slashes. They are used to delimit a regular expression or a substitution and change the behavior of some regular-expression features. Table B-5 lists the most common pattern modifiers, followed by an example.

Table B-5. Pattern modifiers

 Modifier Meaning /i Ignore upper- or lowercase distinctions /s Let . match newline /m Let ^ and \$ match next to embedded \n /x Ignore (most) whitespace and permit comments in patterns /o Compile pattern once only /g Find all matches, not just the first one

As an example, say you were looking for a name in text, but you didn't know if the name had an initial capital letter or was all capitalized. You can use the /i modifier, like so:

\$text = "WATSON and CRICK won the Nobel Prize";

\$text =~ /Watson/i;

print \$&;

This matches (since /i causes upper- and lowercase distinctions to be ignored) and prints out the matched string WATSON.

B.16 Scalar and List Context

Every operation in Perl is evaluated in either scalar or list context. Many operators behave differently depending on the context they are in, returning a list in list context and a scalar in scalar context.

The simplest example of scalar and list contexts is the assignment statement. If the left side (the variable being assigned a value) is a scalar variable, the right side (the values being assigned) are evaluated in scalar context. In the following examples, the right side is an array @array of two elements. When the left side is a scalar variable, it causes @array to be evaluated in scalar context. In scalar context, an array returns the number of elements in an array:

@array = ('one', 'two');

\$a = @array;

print \$a;

This prints:

2

If you put parentheses around the \$a, you make it a list with one element, which causes @array to be evaluated in list context:

@array = ('one', 'two');

(\$a) = @array;

print \$a;

This prints:

one

Notice that when assigning to a list, if there are not enough variables for all the values, the extra values are simply discarded. To capture all the variables, you'd do this:

@array = ('one', 'two');

(\$a, \$b) = @array;

print "\$a \$b";

This prints:

one two

Similarly, if you have too many variables on the left for the number of right variables, the extra variables are assigned the undefined value undef.

When reading about Perl functions and operations, notice what the documentation has to say about scalar and list context. Very often, if your program is behaving strangely, it's because it is evaluating in a different context than you had thought.

Here are some general guidelines on when to expect scalar or list context:

• You get list context from function calls (anything in the argument position is evaluated in list context) and from list assignments.

• You get scalar context from string and number operators (arguments to such operators as . and + are assumed to be scalars); from boolean tests such as the conditional of an if () statement or the arguments to the || logical operator; and from scalar assignment.

B.17 Subroutines and Modules

Subroutines a re defined by the keyword sub, followed by the name of the subroutine, followed by a block enclosed by curly braces { } containing the body of the subroutine. Here's a simple example:

sub a_subroutine {

print "I'm in a subroutine\n";

}

In general, you can call subroutines using the name of the subroutine followed by a parenthesized list of arguments:

a_subroutine();

Arguments can be passed into subroutines as a list of scalars. If any arrays are given as arguments, their elements are interpolated into the list of scalars. The subroutine receives all scalar values as a list in the special variable @_. This example illustrates a subroutine definition and the calling of the subroutine with some arguments:

sub concatenate_dna {

my(\$dna1, \$dna2) = @_;

my(\$concatenation);
\$concatenation = "\$dna1\$dna2";
return \$concatenation;

}
print concatenate_dna('AAA', 'CGC');

This prints:

AAACGC

The arguments 'AAA' and 'CGC' are passed into the subroutine as a list of scalars. The first statement in the subroutine's block:

my(\$dna1, \$dna2) = @_;

assigns this list, available in the special variable @_, to the variables \$dna1 and \$dna2.

The variables \$dna1 and \$dna2 are declared as my variables to keep them local to the subroutine's block. In general, you declare all variables as my variables; this can be enforced by adding the statement use strict; near the beginning of your program. However, it is possible to use global variables that are not declared with my, which can be used anywhere in a program, including within subroutines. In this book, I've not used global variables.

The statement:

my(\$concatenation);

declares another variable for use by the subroutine.

After the statement:

\$concatenation = "\$dna1\$dna2";

performs the work of the subroutine, the subroutine defines its value with the return statement:

return \$concatenation;

The value returned from a call to a subroutine can be used however you wish; in this example, it is given as the argument to the print function.

If any arrays are given as arguments, their elements are interpolated into the @_ list, as in the following example:

sub example_sub {

my(@arguments) = @_;
print "@arguments\n";

}
my @array = (`two', `three', `four');

example_sub(`one', @array, `five');

which prints:

one two three four five

Note that the following attempt to mix arrays and scalars in the arguments to a subroutine won't work:

# This won't work!!

my(@array, \$scalar) = @_;

print \$scalar;

}
my @arr = ('DNA', 'RNA');

my \$string = 'Protein';

In this example, the subroutine's variable @array on the left side in the assignment statement consumes the entire list on the right side in @_, namely ('DNA', 'RNA', 'Protein'). The subroutine's variable \$scalar won't be set, so the subroutine won't print 'Protein' as intended. To pass separate arrays and hashes to a subroutine, you need to use references; see Section 6.4.1 in Chapter 6. Here's a brief example:

sub good_sub {

my(\$arrayref, \$hashref) = @_;

print "@\$arrayref", "\n";
my @keys = keys %\$hashref;
print "@keys", "\n";

}
my @arr = ('DNA', 'RNA');

my %nums = ( 'one' => 1, 'two' => 2);
good_sub(\@arr, \%nums);

which prints:

DNA RNA

one two

Functions'>B.18 Built-in Functions

Perl has a great many built-in functions. Table B-6 is a partial list with short descriptions.

Table B-6. Perl built-in functions

 Function Summary abs VALUE Return the absolute value of its numeric argument atan2 Y, X Return the principal value of the arc tangent of Y/X from - to chdir EXPR Change the working directory to EXPR (or home directory by default) chmod MODE LIST Change the file permissions of the LIST of files to MODE chomp (VARIABLE or LIST) Remove ending newline from string(s), if present chop (VARIABLE or LIST) Remove ending character from string(s) chown UID, GID, LIST Change owner and group of LIST of files to numeric UID and GID close FILEHANDLE Close the file, socket, or pipe associated with FILEHANDLE closedir DIRHANDLE Close the directory associated with DIRHANDLE cos EXPR Return the cosine of the radian number EXPR dbmclose HASH Break the binding between a DBM file and a hash dbmopen HASH, DBNAME, MODE Bind a DBM file to a HASH with permissions given in MODE defined EXPR Return true or false if EXPR has a defined value or not delete EXPR Delete an element (or slice) from a hash or an array. die LIST Exit the program with an error message that includes LIST each HASH Step through a hash with one key, or key-value pair, at a time exec PATHNAME LIST Terminate the program and execute the program PATHNAME with arguments LIST exists EXPR Return true if hash key or array index exists exit EXPR Exit the program with the return value of EXPR exp EXPR Return the value of e raised to the exponent EXPR format Declare a format for use by the write function grep EXPR, LIST Return list of elements of LIST for which EXPR is true gmtime Get Greenwich mean time; Sunday is day 0, January is month 0, year is number of years since 1900—example: (\$sec,\$min,\$hour,\$mday,\$mon,\$year,\$wday,\$yday, \$isdaylightsavingstime) = gmtime; goto LABEL Program control goes to statement marked with LABEL hex EXPR Return decimal value of hexadecimal EXPR index STR, SUBSTR Give the position of the first occurrence of SUBSTR in STR int EXPR Give the integer portion of the number in EXPR join EXPR, LIST Join the strings in LIST into a single string, separated by EXPR keys HASH Return a list of all the keys in HASH last LABEL Exit the immediately enclosing loop by default, or loop with LABEL lc EXPR Return a lowercased copy of string in EXPR lcfirst EXPR Return a copy of EXPR with first character lowercased length EXPR Return the length in characters of EXPR localtime Get local time in same format as in gmtime function log EXPR Return natural logarithm of number EXPR m/PATTERN/ The match operator for the regular-expression PATTERN, often abbreviated as /PATTERN/ map BLOCK LIST (or map EXPR, LIST) Evaluate BLOCK or EXPR for each element of LIST, return list of return values mkdir FILENAME Create the directory FILENAME my EXPR Localize the variables in EXPR to the enclosing block next LABEL Go to next iteration of enclosing loop by default or to loop marked with LABEL oct EXPR Return decimal value of octal value in EXPR open FILEHANDLE, EXPR Open a file by associating FILEHANDLE with the file and options given in EXPR opendir DIRHANDLE, EXPR Open the directory EXPR and assign handle DIRHANDLE pop ARRAY Remove and return the last element of ARRAY pos SCALAR Give location in string SCALAR where last m//g search left off print FILEHANDLE LIST Print LIST of strings to FILEHANDLE (default STDOUT) printf FILEHANDLE FORMAT, LIST Print string specified by FORMAT and variables LIST to FILEHANDLE push ARRAY, LIST Place the elements of LIST at the end of ARRAY rand EXPR Give pseudorandom decimal number from 0 to less than EXPR (default 1) readdir DIRHANDLE Return list of entries of directory DIRHANDLE redo LABEL Restart a loop block without reevaluating the conditional ref EXPR Return true or false if EXPR is a reference or not: if true, returned value indicates type of reference rename OLDNAME, NEWNAME Change the name of a file return EXPR Return from the current subroutine with value EXPR reverse LIST Give LIST in reverse order, or reverse strings in scalar context rindex STR, SUBSTR Like the index function but returns last occurrence of SUBSTR in STR rmdir FILENAME Delete the directory FILENAME s/PATTERN/REPLACEMENT/ Replace the match of regular-expression PATTERN with string REPLACEMENT scalar EXPR Force EXPR to be evaluated in scalar context seek FILEHANDLE, OFFSET, WHENCE Position the file pointer for FILEHANDLE to OFFSET bytes (if WHENCE is 0, current position plus OFFSET if WHENCE is 1, or OFFSET bytes from the end if WHENCE is 2) shift ARRAY Remove and return the first element of ARRAY sin EXPR Return the sine of the radian number EXPR sleep EXPR Cause the program to sleep for EXPR seconds sort USERSUB LIST (or sort BLOCK LIST) Sort the LIST according to the order in USERSUB or BLOCK (default standard string order) splice ARRAY, OFFSET, LENGTH, LIST Remove LENGTH elements at OFFSET in ARRAY and replace with LIST, if present split /PATTERN/, EXPR Split the string EXPR at occurrences of /PATTERN/, return list sprintf FORMAT, LIST Return a string formatted as in the printf function sqrt EXPR Return the square root of the number EXPR. srand EXPR Set random number seed for rand operator; only needed in versions of Perl before 5.004 stat (FILEHANDLE or EXPR) Return statistics on file EXPR or its FILEHANDLE—example: (\$dev,\$inode,\$mode,\$num_of_links,\$uid,\$gid,\$rdev,\$size,\$accesstime, \$modifiedtime,\$changetime,\$blksize,\$blocks) = stat \$filename; study SCALAR Try to optimize subsequent pattern matches on string SCALAR sub NAME BLOCK Define a subroutine named NAME with program code in BLOCK substr EXPR, OFFSET, LENGTH,REPLACEMENT Return substring of string EXPR at position OFFSET and length LENGTH; the substring is replaced with REPLACEMENT if used system PATHNAME LIST Execute any program PATHNAME with arguments LIST; returns exit status of program, not its output; to capture ouput, use backticks—example: @output = `/bin/who`; tell FILEHANDLE Return current file position in bytes in FILEHANDLE tr/ORIGINAL/REPLACEMENT/ Transliterates each character in ORIGINAL with corresponding character in REPLACEMENT truncate (FILEHANDLE or EXPR), LENGTH Shorten file EXPR or opened with FILEHANDLE to LENGTH bytes uc EXPR Return uppercased version of string EXPR ucfirst EXPR Return string EXPR with first character capitalized undef EXPR Return the undefined value; if a defined variable or subroutine EXPR is given, it's no longer defined; it can be assigned a value when you don't need to save the value unlink LIST Delete the LIST of files unshift ARRAY, LIST Add LIST elements to the beginning of ARRAY use MODULE Load the MODULE values HASH Return a list of all values of the HASH wantarray In a subroutine, return true if calling program expects a list return value warn LIST Print error message including LIST write FILEHANDLE Write formatted record to FILEHANDLE (default STDOUT) as defined by the format function

Colophon

Our look is the result of reader comments, our own experimentation, and feedback from distribution channels. Distinctive covers complement our distinctive approach to technical topics, breathing personality and life into potentially dry subjects.

The animals on the cover of Beginning Perl for Bioinformatics are green frog (Rana clamitans) and American bullfrog (Rana catesbeiana) tadpoles.

Tadpoles are the larvae of frogs and toads. They are aquatic and when first hatched have large, round heads and long, flat tails. Through a complex process of metamorphosis, tadpoles change from small fishlike creatures to the more familiar frogs and toads. This process can take from 10 days to 3 years depending on the species.

During the first stages of metamorphosis, a tadpole's hind legs sprout, its head begins to flatten, and its tail becomes shorter. In its early life, a tadpole feeds primarily on diatoms, algae, and small quantities of zooplankton. As metamorphosis continues, it stops eating and begins to reabsorb its tail for sustenance while its digestive system changes from primarily vegetarian to carnivorous. During the final stages of metamorphosis, the tadpole's front legs appear, its jaws form, its skeleton hardens, and its gills disappear as the lungs develop. It soon begins to breathe air at the surface of the water. A short time later, the tadpole emerges from the water, reabsorbs the last of its tail, and hops off as a frog or a toad.

Mary Anne Weeks Mayo was the production editor and copyeditor for Beginning Perl for Bioinformatics. Matt Hutchinson and Jane Ellin provided quality control. Edie Shapiro, Matt Hutchinson, and Derek DiMatteo provided production assistance. Ellen Troutman-Zaig wrote the index.

Ellie Volckhausen designed the cover of this book, based on a series design by Edie Freedman. The cover image is an original illustration created by Lorrie LeJeune. Emma Colby produced the cover layout with QuarkXPress 4.1 using Adobe's ITC Garamond font.

Melanie Wang designed the interior layout, based on a series design by David Futato. Neil Walls converted the files from SGML to FrameMaker 5.5.6 using tools created by Mike Sierra. The text font is Linotype Birka; the heading font is Adobe Myriad Condensed; and the code font is LucasFont's TheSans Mono Condensed. The illustrations that appear in the book were produced by Robert Romano and Jessamyn Read using Macromedia FreeHand 9 and Adobe Photoshop 6. The tip and warning icons were drawn by Christopher Bing. This colophon was written by Lorrie LeJeune.