Beginning Perl for Bioinformatics


B.15.2.5 Matching any character with



Download 1.4 Mb.
Page28/28
Date conversion29.03.2017
Size1.4 Mb.
1   ...   20   21   22   23   24   25   26   27   28

B.15.2.5 Matching any character with .

The period or dot . represents any character except a newline. (The pattern modifier /s makes it also match a newline.) So, . is like a character class that specifies every character.



B.15.2.6 Beginning and end of strings with ^ and $

The ^ metacharacter doesn't match a character; rather, it asserts that the item that follows must be at the beginning of the string. Similarly, the $ metacharacter doesn't match a character but asserts that the item that precedes it must be at the end of the string (or before the final newline). For example: /^Watson and Crick/ matches if the string starts with Watson and Crick; and /Watson and Crick$/ matches if the string ends with Watson and Crick or Watson and Crick\n.



B.15.2.7 Quantifiers: * + {MIN,} {MIN,MAX} ?

These metacharacters indicate the repetition of an item. The * metacharacter indicates zero, one, or more of the preceding item. The + metacharacter indicates one or more of the preceding item. The brace { } metacharacters let you specify exactly the number of previous items, or a range. For instance, {3} means exactly three of the preceding item; {3,7} means three, four, five, six, or seven of the preceding item; and {3,} means three or more of the preceding item. The ? matches none or one of the preceding item.


B.15.2.8 Making quantifiers match minimally with ?

The quantifiers just shown are greedy (or maximal) by default, meaning that they match as many items as possible. Sometimes, you want a minimal match that will match as few items as possible. You get that by following each of * + {} ? with a ?. So, for instance, *? tries to match as few as possible, perhaps even none, of the preceding item before it tries to match one or more of the preceding item. Here's a maximal match:

'hear ye hear ye hear ye' =~ /hear.*ye/;

print $&;

This matches 'hear' followed by .* (as many characters as possible), followed by 'ye', and prints:

hear ye hear ye hear ye

Here is a minimal match:

'hear ye hear ye hear ye' =~ /hear.*?ye/;

print $&;

This matches 'hear' followed by .*? (the fewest number of characters possible), followed by 'ye', and prints:

hear ye


B.15.3 Capturing Matched Patterns

You can place parentheses around parts of the pattern for which you want to know the matched string. For example:

$alphabet = 'abcdefghijklmnopqrstuvwxyz';

$alphabet =~ /k(lmnop)q/;

print $1;

prints:


lmnop

You can place as many pairs of parentheses in a regular expression as you like; Perl automatically stores their matched substrings in special variables named $1, $2, and so on. The matches are numbered in order of the left-to-right appearance of their opening parenthesis.

Here's a more intricate example of capturing parts of a matched pattern in a string:

$alphabet = 'abcdefghijklmnopqrstuvwxyz';

$alphabet =~ /(((a)b)c)/;

print "First pattern = ", $1,"\n";

print "Second pattern = ", $2,"\n";

print "Third pattern = ", $3,"\n";

This prints:

First pattern = abc

Second pattern = ab

Third pattern = a



B.15.4 Metasymbols

Metasymbols are sequences of two or more characters consisting of backslashes before normal characters. These metasymbols have special meanings in Perl regular expressions (and in double-quoted strings for most of them). There are quite a few of them, but that's because they're so useful. Table B-3 lists most of these metasymbols. The column "Atomic" indicates Yes if the metasymbol matches an item, No if the metasymbol just makes an assertion, and - if it takes some other action.

Table B-3. Alphanumeric metasymbols



Symbol

Atomic

Meaning

\0

Yes

Match the null character (ASCII NULL)

\NNN

Yes

Match the character given in octal, up to 377

\n

Yes

Match nth previously captured string (decimal)

\a

Yes

Match the alarm character (BEL)

\A

No

true at the beginning of a string

\b

Yes

Match the backspace character (BS)

\b

No

True at word boundary

\B

No

True when not at word boundary

\cX

Yes

Match the control character Control-X

\d

Yes

Match any digit character

\D

Yes

Match any nondigit character


\e

Yes

Match the escape character (ASCII ESC, not backslash)

\E

-

End case (\L, \U) or metaquote (\Q) translation

\f

Yes

Match the formfeed character (FF)

\G

No

true at end-of-match position of prior m//g

\l

-

Lowercase the next character only

\L

-

Lowercase till \E

\n

Yes

Match the newline character (usually NL, but CR on Macs)

\Q

-

Quote (do-meta) metacharacters till \E

\r

Yes

Match the return character (usually CR, but NL on Macs)

\s

Yes

Match any whitespace character

\S

Yes

Match any nonwhitespace character

\t

Yes

Match the tab character (HT)

\u

-

Titlecase the next character only


\U

-

Uppercase (not titlecase) till \E

\w

Yes

Match any "word" character (alphanumerics plus _ )

\W

Yes

Match any nonword character

\x{abcd}

Yes

Match the character given in hexadecimal

\z

No

true at end of string only

\Z

No

true at end of string or before optional newline

B.15.5 Extending Regular-Expression Sequences

Table B-4 includes several useful features that have been added to Perl's regular-expression capabilities.
Table B-4. Extended regular-expression sequences


Extension

Atomic

Meaning

(?#...)

No

Comment, discard

(?:...)

Yes

Cluster-only parentheses, no capturing

(?imsx-imsx)


No

Enable/disable pattern modifiers

(?imsx-imsx:...)

Yes

Cluster-only parentheses plus modifiers

(?=...)

No

True if lookahead assertion succeeds

(?!...)

No

True if lookahead assertion fails

(?<=...)

No

True if lookbehind assertion succeeds

(?

No

True if lookbehind assertion fails

(?>...)

Yes

Match nonbacktracking subpattern

(?{...})

No

Execute embedded Perl code

(??{...})

Yes

Match regex from embedded Perl code

(?(...)...|...)

Yes

Match with if-then-else pattern

(?(...)...)

Yes

Match with if-then pattern

B.15.6 Pattern Modifiers

Pattern modifiers are single-letter commands placed after the forward slashes. They are used to delimit a regular expression or a substitution and change the behavior of some regular-expression features. Table B-5 lists the most common pattern modifiers, followed by an example.

Table B-5. Pattern modifiers


Modifier

Meaning

/i

Ignore upper- or lowercase distinctions

/s

Let . match newline

/m

Let ^ and $ match next to embedded \n

/x

Ignore (most) whitespace and permit comments in patterns

/o

Compile pattern once only

/g

Find all matches, not just the first one

As an example, say you were looking for a name in text, but you didn't know if the name had an initial capital letter or was all capitalized. You can use the /i modifier, like so:

$text = "WATSON and CRICK won the Nobel Prize";

$text =~ /Watson/i;

print $&;

This matches (since /i causes upper- and lowercase distinctions to be ignored) and prints out the matched string WATSON.

B.16 Scalar and List Context

Every operation in Perl is evaluated in either scalar or list context. Many operators behave differently depending on the context they are in, returning a list in list context and a scalar in scalar context.

The simplest example of scalar and list contexts is the assignment statement. If the left side (the variable being assigned a value) is a scalar variable, the right side (the values being assigned) are evaluated in scalar context. In the following examples, the right side is an array @array of two elements. When the left side is a scalar variable, it causes @array to be evaluated in scalar context. In scalar context, an array returns the number of elements in an array:

@array = ('one', 'two');

$a = @array;

print $a;

This prints:

2

If you put parentheses around the $a, you make it a list with one element, which causes @array to be evaluated in list context:


@array = ('one', 'two');

($a) = @array;

print $a;

This prints:

one

Notice that when assigning to a list, if there are not enough variables for all the values, the extra values are simply discarded. To capture all the variables, you'd do this:



@array = ('one', 'two');

($a, $b) = @array;

print "$a $b";

This prints:

one two

Similarly, if you have too many variables on the left for the number of right variables, the extra variables are assigned the undefined value undef.



When reading about Perl functions and operations, notice what the documentation has to say about scalar and list context. Very often, if your program is behaving strangely, it's because it is evaluating in a different context than you had thought.

Here are some general guidelines on when to expect scalar or list context:



  • You get list context from function calls (anything in the argument position is evaluated in list context) and from list assignments.

  • You get scalar context from string and number operators (arguments to such operators as . and + are assumed to be scalars); from boolean tests such as the conditional of an if () statement or the arguments to the || logical operator; and from scalar assignment.

B.17 Subroutines and Modules

Subroutines a re defined by the keyword sub, followed by the name of the subroutine, followed by a block enclosed by curly braces { } containing the body of the subroutine. Here's a simple example:

sub a_subroutine {

print "I'm in a subroutine\n";

}

In general, you can call subroutines using the name of the subroutine followed by a parenthesized list of arguments:


a_subroutine();

Arguments can be passed into subroutines as a list of scalars. If any arrays are given as arguments, their elements are interpolated into the list of scalars. The subroutine receives all scalar values as a list in the special variable @_. This example illustrates a subroutine definition and the calling of the subroutine with some arguments:

sub concatenate_dna {

my($dna1, $dna2) = @_;

my($concatenation);
$concatenation = "$dna1$dna2";
return $concatenation;

}
print concatenate_dna('AAA', 'CGC');

This prints:

AAACGC


The arguments 'AAA' and 'CGC' are passed into the subroutine as a list of scalars. The first statement in the subroutine's block:

my($dna1, $dna2) = @_;

assigns this list, available in the special variable @_, to the variables $dna1 and $dna2.

The variables $dna1 and $dna2 are declared as my variables to keep them local to the subroutine's block. In general, you declare all variables as my variables; this can be enforced by adding the statement use strict; near the beginning of your program. However, it is possible to use global variables that are not declared with my, which can be used anywhere in a program, including within subroutines. In this book, I've not used global variables.

The statement:

my($concatenation);

declares another variable for use by the subroutine.

After the statement:

$concatenation = "$dna1$dna2";

performs the work of the subroutine, the subroutine defines its value with the return statement:

return $concatenation;

The value returned from a call to a subroutine can be used however you wish; in this example, it is given as the argument to the print function.

If any arrays are given as arguments, their elements are interpolated into the @_ list, as in the following example:

sub example_sub {

my(@arguments) = @_;
print "@arguments\n";

}
my @array = (`two', `three', `four');

example_sub(`one', @array, `five');

which prints:

one two three four five

Note that the following attempt to mix arrays and scalars in the arguments to a subroutine won't work:

# This won't work!!

sub bad_sub {

my(@array, $scalar) = @_;

print $scalar;

}
my @arr = ('DNA', 'RNA');

my $string = 'Protein';
bad_sub(@arr, $string);

In this example, the subroutine's variable @array on the left side in the assignment statement consumes the entire list on the right side in @_, namely ('DNA', 'RNA', 'Protein'). The subroutine's variable $scalar won't be set, so the subroutine won't print 'Protein' as intended. To pass separate arrays and hashes to a subroutine, you need to use references; see Section 6.4.1 in Chapter 6. Here's a brief example:

sub good_sub {

my($arrayref, $hashref) = @_;


print "@$arrayref", "\n";
my @keys = keys %$hashref;
print "@keys", "\n";

}
my @arr = ('DNA', 'RNA');

my %nums = ( 'one' => 1, 'two' => 2);
good_sub(\@arr, \%nums);

which prints:

DNA RNA

one two


Functions'>B.18 Built-in Functions

Perl has a great many built-in functions. Table B-6 is a partial list with short descriptions.


Table B-6. Perl built-in functions


Function

Summary

abs VALUE

Return the absolute value of its numeric argument

atan2 Y, X

Return the principal value of the arc tangent of Y/X from -

to


chdir EXPR

Change the working directory to EXPR (or home directory by default)

chmod MODE LIST

Change the file permissions of the LIST of files to MODE

chomp (VARIABLE or LIST)

Remove ending newline from string(s), if present

chop (VARIABLE or LIST)

Remove ending character from string(s)

chown UID, GID, LIST

Change owner and group of LIST of files to numeric UID and GID

close FILEHANDLE

Close the file, socket, or pipe associated with FILEHANDLE

closedir DIRHANDLE

Close the directory associated with DIRHANDLE

cos EXPR

Return the cosine of the radian number EXPR

dbmclose HASH

Break the binding between a DBM file and a hash

dbmopen HASH, DBNAME, MODE

Bind a DBM file to a HASH with permissions given in MODE

defined EXPR

Return true or false if EXPR has a defined value or not

delete EXPR

Delete an element (or slice) from a hash or an array.

die LIST

Exit the program with an error message that includes LIST


each HASH

Step through a hash with one key, or key-value pair, at a time

exec PATHNAME LIST

Terminate the program and execute the program PATHNAME with arguments LIST

exists EXPR

Return true if hash key or array index exists

exit EXPR

Exit the program with the return value of EXPR

exp EXPR

Return the value of e raised to the exponent EXPR

format

Declare a format for use by the write function

grep EXPR, LIST

Return list of elements of LIST for which EXPR is true

gmtime

Get Greenwich mean time; Sunday is day 0, January is month 0, year is number of years since 1900—example:

($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,

$isdaylightsavingstime) = gmtime;


goto LABEL

Program control goes to statement marked with LABEL

hex EXPR

Return decimal value of hexadecimal EXPR

index STR, SUBSTR

Give the position of the first occurrence of SUBSTR in STR

int EXPR

Give the integer portion of the number in EXPR

join EXPR, LIST


Join the strings in LIST into a single string, separated by EXPR

keys HASH

Return a list of all the keys in HASH

last LABEL

Exit the immediately enclosing loop by default, or loop with LABEL

lc EXPR

Return a lowercased copy of string in EXPR

lcfirst EXPR

Return a copy of EXPR with first character lowercased

length EXPR

Return the length in characters of EXPR

localtime

Get local time in same format as in gmtime function

log EXPR

Return natural logarithm of number EXPR

m/PATTERN/

The match operator for the regular-expression PATTERN, often abbreviated as

/PATTERN/



map BLOCK LIST (or map EXPR, LIST)

Evaluate BLOCK or EXPR for each element of LIST, return list of return values

mkdir FILENAME

Create the directory FILENAME

my EXPR

Localize the variables in EXPR to the enclosing block

next LABEL

Go to next iteration of enclosing loop by default or to loop marked with LABEL

oct EXPR

Return decimal value of octal value in EXPR


open FILEHANDLE, EXPR

Open a file by associating FILEHANDLE with the file and options given in EXPR

opendir DIRHANDLE, EXPR

Open the directory EXPR and assign handle DIRHANDLE

pop ARRAY

Remove and return the last element of ARRAY

pos SCALAR

Give location in string SCALAR where last m//g search left off

print FILEHANDLE LIST

Print LIST of strings to FILEHANDLE (default STDOUT)

printf FILEHANDLE FORMAT, LIST

Print string specified by FORMAT and variables LIST to FILEHANDLE

push ARRAY, LIST

Place the elements of LIST at the end of ARRAY

rand EXPR

Give pseudorandom decimal number from 0 to less than EXPR (default 1)

readdir DIRHANDLE

Return list of entries of directory DIRHANDLE

redo LABEL

Restart a loop block without reevaluating the conditional

ref EXPR

Return true or false if EXPR is a reference or not: if true, returned value indicates type of reference

rename OLDNAME, NEWNAME

Change the name of a file

return EXPR

Return from the current subroutine with value EXPR


reverse LIST

Give LIST in reverse order, or reverse strings in scalar context

rindex STR, SUBSTR

Like the index function but returns last occurrence of SUBSTR in STR

rmdir FILENAME

Delete the directory FILENAME

s/PATTERN/REPLACEMENT/

Replace the match of regular-expression PATTERN with string REPLACEMENT

scalar EXPR

Force EXPR to be evaluated in scalar context

seek FILEHANDLE, OFFSET, WHENCE

Position the file pointer for FILEHANDLE to OFFSET bytes (if WHENCE is 0, current position plus OFFSET if WHENCE is 1, or OFFSET bytes from the end if WHENCE is 2)

shift ARRAY

Remove and return the first element of ARRAY

sin EXPR

Return the sine of the radian number EXPR

sleep EXPR

Cause the program to sleep for EXPR seconds

sort USERSUB LIST (or sort BLOCK LIST)

Sort the LIST according to the order in USERSUB or BLOCK (default standard string order)

splice ARRAY, OFFSET, LENGTH, LIST

Remove LENGTH elements at OFFSET in ARRAY and replace with LIST, if present

split /PATTERN/, EXPR


Split the string EXPR at occurrences of /PATTERN/, return list

sprintf FORMAT, LIST

Return a string formatted as in the printf function

sqrt EXPR

Return the square root of the number EXPR.

srand EXPR

Set random number seed for rand operator; only needed in versions of Perl before 5.004

stat (FILEHANDLE or EXPR)

Return statistics on file EXPR or its FILEHANDLE—example:

($dev,$inode,$mode,$num_of_links,$uid,$gid,$rdev,$size,$accesstime,

$modifiedtime,$changetime,$blksize,$blocks) = stat $filename;


study SCALAR

Try to optimize subsequent pattern matches on string SCALAR

sub NAME BLOCK

Define a subroutine named NAME with program code in BLOCK

substr EXPR, OFFSET, LENGTH,REPLACEMENT

Return substring of string EXPR at position OFFSET and length LENGTH; the substring is replaced with REPLACEMENT if used

system PATHNAME LIST

Execute any program PATHNAME with arguments LIST; returns exit status of program, not its output; to capture ouput, use backticks—example:

@output = `/bin/who`;



tell FILEHANDLE

Return current file position in bytes in FILEHANDLE

tr/ORIGINAL/REPLACEMENT/

Transliterates each character in ORIGINAL with corresponding character in REPLACEMENT


truncate (FILEHANDLE or EXPR), LENGTH

Shorten file EXPR or opened with FILEHANDLE to LENGTH bytes

uc EXPR

Return uppercased version of string EXPR

ucfirst EXPR

Return string EXPR with first character capitalized

undef EXPR

Return the undefined value; if a defined variable or subroutine EXPR is given, it's no longer defined; it can be assigned a value when you don't need to save the value

unlink LIST

Delete the LIST of files

unshift ARRAY, LIST

Add LIST elements to the beginning of ARRAY

use MODULE

Load the MODULE

values HASH

Return a list of all values of the HASH

wantarray

In a subroutine, return true if calling program expects a list return value

warn LIST

Print error message including LIST

write FILEHANDLE

Write formatted record to FILEHANDLE (default STDOUT) as defined by the format function

Colophon

Our look is the result of reader comments, our own experimentation, and feedback from distribution channels. Distinctive covers complement our distinctive approach to technical topics, breathing personality and life into potentially dry subjects.

The animals on the cover of Beginning Perl for Bioinformatics are green frog (Rana clamitans) and American bullfrog (Rana catesbeiana) tadpoles.

Tadpoles are the larvae of frogs and toads. They are aquatic and when first hatched have large, round heads and long, flat tails. Through a complex process of metamorphosis, tadpoles change from small fishlike creatures to the more familiar frogs and toads. This process can take from 10 days to 3 years depending on the species.

During the first stages of metamorphosis, a tadpole's hind legs sprout, its head begins to flatten, and its tail becomes shorter. In its early life, a tadpole feeds primarily on diatoms, algae, and small quantities of zooplankton. As metamorphosis continues, it stops eating and begins to reabsorb its tail for sustenance while its digestive system changes from primarily vegetarian to carnivorous. During the final stages of metamorphosis, the tadpole's front legs appear, its jaws form, its skeleton hardens, and its gills disappear as the lungs develop. It soon begins to breathe air at the surface of the water. A short time later, the tadpole emerges from the water, reabsorbs the last of its tail, and hops off as a frog or a toad.

Mary Anne Weeks Mayo was the production editor and copyeditor for Beginning Perl for Bioinformatics. Matt Hutchinson and Jane Ellin provided quality control. Edie Shapiro, Matt Hutchinson, and Derek DiMatteo provided production assistance. Ellen Troutman-Zaig wrote the index.

Ellie Volckhausen designed the cover of this book, based on a series design by Edie Freedman. The cover image is an original illustration created by Lorrie LeJeune. Emma Colby produced the cover layout with QuarkXPress 4.1 using Adobe's ITC Garamond font.

Melanie Wang designed the interior layout, based on a series design by David Futato. Neil Walls converted the files from SGML to FrameMaker 5.5.6 using tools created by Mike Sierra. The text font is Linotype Birka; the heading font is Adobe Myriad Condensed; and the code font is LucasFont's TheSans Mono Condensed. The illustrations that appear in the book were produced by Robert Romano and Jessamyn Read using Macromedia FreeHand 9 and Adobe Photoshop 6. The tip and warning icons were drawn by Christopher Bing. This colophon was written by Lorrie LeJeune.






1   ...   20   21   22   23   24   25   26   27   28


The database is protected by copyright ©hestories.info 2017
send message

    Main page