Beginning Perl for Bioinformatics



Download 1.4 Mb.
Page20/28
Date conversion29.03.2017
Size1.4 Mb.
1   ...   16   17   18   19   20   21   22   23   ...   28

10.3.2 Using Scalars

A second way to separate annotations from sequences in GenBank records is to read the entire record into a scalar variable and operate on it with regular expressions. For some kinds of data, this can be a more convenient way to parse the input (compared to scanning through an array, as in Example 10-1).

Usually string data is stored one line per scalar variable with its newlines, if any, at the end of the string. Sometimes, however, you store several lines concatenated together in one string that is, in turn, stored in a single scalar variable. These multiline strings aren't uncommon; you used them to gather the sequence from a FASTA file in Examples Example 6-2 and Example 6-3. Regular expressions have pattern modifiers that can be used to make multiline strings with their embedded newlines easy to use.

10.3.2.1 Pattern modifiers

The pattern modifiers we've used so far are /g, for global matching, and /i, for case-insensitive matching. Let's take a look at two more that affect the way regular expressions interact with the newlines in scalars.

Recall that previous regular expressions have used the caret (^), dot (.), and dollar sign ($) metacharacters. The ^ anchors a regular expression to the beginning of a string, by default, so that /^THE BEGUINE/ matches a string that begins with "THE BEGUINE". Similarly, $ anchors an expression to the end of the string, and the dot (.) matches any character except a newline.

The following pattern modifiers affect these three metacharacters:


  • The /s modifier assumes you want to treat the whole string as a single line, even with embedded newlines, so it makes the dot metacharacter match any character including newlines.

  • The /m modifier assumes you want to treat the whole string as a multiline, with embedded newlines, so it extends the ^ and the $ to match after, or before, a newline, embedded in the string.

10.3.2.2 Examples of pattern modifiers

Here's an example of the default behavior of caret (^), dot (.), and dollar sign ($):

use warnings;

"AAC\nGTT" =~ /^.*$/;

print $&, "\n";

This demonstrates the default behavior without the /m or /s modifiers and prints the warning:

Use of uninitialized value in print statement at line 3.

The print statement tries to print $& , a special variable that is always set to the last successful pattern match. This time, since the pattern doesn't match, the variable $& isn't set, and you get a warning message for attempting to print an uninitialized value.

Why doesn't the match succeed? First, let's examine the ^.*$ pattern. It begins with a ^, which means it must match from the beginning of the string. It ends with a $, which means it must also match at the end of the string (the end of the string may contain a single newline, but no other newlines are allowed). The .* means it must match zero or more (*) of any characters (.) except the newline. So, in other words, the pattern ^.*$ matches any string that doesn't contain a newline except for a possible single newline as the last character. But since the string in question, "ACC\nGTT" does contain an embedded newline \n that isn't the last character, the pattern match fails.

In the next examples, the pattern modifiers /m and /s change the default behaviors for the metacharacters ^, and $, and the dot:

"AAC\nGTT" =~ /^.*$/m;

print $&, "\n";

This snippet prints out AAC and demonstrates the /m modifier. The /m extends the meaning of the ^ and the $ so they also match around embedded newlines. Here, the pattern matches from the beginning of the string up to the first embedded newline.

The next snippet of code demonstrates the /s modifier:

"AAC\nGTT" =~ /^.*$/s;

print $&, "\n";

which produces the output:

AAC

GTT

The /s modifier changes the meaning of the dot metacharacter so that it matches any character including newlines. With the /s modifier, the pattern matches everything from the beginning of the string to the end of the string, including the newline. Notice when it prints, it prints the embedded newline.



10.3.2.3 Separating annotations from sequence

Now that you've met the pattern-matching modifiers and regular expressions that will be your main tools for parsing a GenBank file as a scalar, let's try separating the annotations from the sequence.

The first step is to get the GenBank record stored as a scalar variable. Recall that a GenBank record starts with a line beginning with the word "LOCUS" and ends with the end-of-record separator: a line containing two forward slashes.

First you want to read a GenBank record and store it in a scalar variable. There's a device called an input record separator denoted by the special variable $/ that lets you do exactly that. The input record separator is usually set to a newline, so each call to read a scalar from a filehandle gets one line. Set it to the GenBank end-of-record separator like so:

$/ = "//\n";

A call to read a scalar from a filehandle takes all the data up to the GenBank end-of-record separator. So the line $record = in Example 10-2 stores the multiline GenBank record into the scalar variable $record. Later, you'll see that you can keep repeating this call in order to read in successive GenBank records from a GenBank library file.

After reading in the record, you'll parse it into the annotation and sequence parts making use of /s and /m pattern modifiers. Extracting the annotation and sequence is the easy part; parsing the annotation will occupy most of the remainder of the chapter.

Example 10-2. Extract annotation and sequence from Genbank record

#!/usr/bin/perl

# Extract the annotation and sequence sections from the first

# record of a GenBank library


use strict;

use warnings;

use BeginPerlBioinfo; # see Chapter 6 about this module
# Declare and initialize variables

my $annotation = '';

my $dna = '';

my $record = '';

my $filename = 'record.gb';

my $save_input_separator = $/;


# Open GenBank library file

unless (open(GBFILE, $filename)) {

print "Cannot open GenBank file \"$filename\"\n\n";

exit;


}
# Set input separator to "//\n" and read in a record to a scalar

$/ = "//\n";


$record = ;
# reset input separator

$/ = $save_input_separator;


# Now separate the annotation from the sequence data

($annotation, $dna) = ($record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s);


# Print the two pieces, which should give us the same as the

# original GenBank file, minus the // at the end

print $annotation, $dna;

exit;

The output from this program is the same as the GenBank file listed previously, minus the last line, which is the end-of-record separator //.

Let's focus on the regular expression that parses the annotation and sequence out of the $record variable. This is the most complicated regular expression so far:

$record = /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s.

There are two pairs of parentheses in the regular expression: (LOCUS.*ORIGIN\s*\n) and (.*). The parentheses are metacharacters whose purpose is to remember the parts of the data that match the pattern within the parentheses, namely, the annotation and the sequence. Also note that the pattern match returns an array whose elements are the matched parenthetical patterns. After you match the annotation and the sequence within the pairs of parentheses in the regular expression, you simply assign the matched patterns to the two variables $annotation and $dna, like so:

($annotation, $dna) = ($record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s);

Notice that at the end of the pattern, we've added the /s pattern matching modifier, which, as you've seen earlier, allows a dot to match any character including an embedded newline. (Of course, since we've got a whole GenBank record in the $record scalar, there are a lot of embedded newlines.)

Next, look at the first pair of parentheses:

(LOCUS.*ORIGIN\s*\n)

This whole expression is anchored at the beginning of the string by preceding it with a ^ metacharacter. (/s doesn't change the meaning of the ^ character in a regular expression.)

Inside the parentheses, you match from where the string LOCUS appears at the beginning of the GenBank record, followed by any number of characters including newlines with .*, followed by the string ORIGIN, followed by possibly some whitespace with \s*, followed by a newline \n. This matches the annotation part of the GenBank record.

Now, look at the second parentheses and the remainder:

(.*)\/\/\n

This is easier. The .* matches any character, including newlines because of the /s pattern modifier at the end of the pattern match. The parentheses are followed by the end-of-record line, //, including the newline at the end, with the slashes preceded by backslashes to show that you want to match them exactly. They're not delimiters of the pattern matching operator. The end result is the GenBank record with the annotation and the sequence separated into the variables $annotation and $sequence. Although the regular expression I used requires a bit of explanation, the attractive thing about this approach is that it took only one line of Perl code to extract both annotation and sequence.


10.4 Parsing Annotations

Now that you've successfully extracted the sequence, let's look at parsing the annotations of a GenBank file.

Looking at a GenBank record, it's interesting to think about how to extract the useful information. The FEATURES table is certainly a key part of the story. It has considerable structure: what should be preserved, and what is unnecessary? For instance, sometimes you just want to see if a word such as "endonuclease" appears anywhere in the record. For this, you just need a subroutine that searches for any regular expression in the annotation. Sometimes this is enough, but when detailed surgery is necessary, Perl has the necessary tools to make the operation successful.

10.4.1 Using Arrays

Example 10-3 parses a few pieces of information from the annotations in a GenBank file. It does this using the data in the form of an array.

Example 10-3. Parsing GenBank annotations using arrays

#!/usr/bin/perl

# Parsing GenBank annotations using arrays
use strict;

use warnings;

use BeginPerlBioinfo; # see Chapter 6 about this module
# Declare and initialize variables

my @genbank = ( );

my $locus = '';

my $accession = '';

my $organism = '';
# Get GenBank file data

@genbank = get_file_data('record.gb');


# Let's start with something simple. Let's get some of the identifying

# information, let's say the locus and accession number (here the same

# thing) and the definition and the organism.

for my $line (@genbank) {

if($line =~ /^LOCUS/) {

$line =~ s/^LOCUS\s*//;

$locus = $line;

}elsif($line =~ /^ACCESSION/) {

$line =~ s/^ACCESSION\s*//;

$accession = $line;

}elsif($line =~ /^ ORGANISM/) {

$line =~ s/^\s*ORGANISM\s*//;

$organism = $line;

}

}

print "*** LOCUS ***\n";



print $locus;

print "*** ACCESSION ***\n";

print $accession;

print "*** ORGANISM ***\n";

print $organism;
exit;

Here's the output from Example 10-3:

*** LOCUS ***

AB031069 2487 bp mRNA PRI 27-MAY-2000

*** ACCESSION ***

AB031069


*** ORGANISM ***

Homo sapiens

Now let's slightly extend that program to handle the DEFINITION field. Notice that the DEFINITION field can extend over more than one line. To collect that field, use a trick you've already seen in Example 10-1: set a flag when you're in the "state" of collecting a definition. The flag variable is called, unsurprisingly, $flag.

Example 10-4. Parsing GenBank annotations using arrays, take 2

#!/usr/bin/perl

# Parsing GenBank annotations using arrays, take 2
use strict;

use warnings;

use BeginPerlBioinfo; # see Chapter 6 about this module
# Declare and initialize variables

my @genbank = ( );

my $locus = '';

my $accession = '';

my $organism = '';

my $definition = '';

my $flag = 0;
# Get GenBank file data

@genbank = get_file_data('record.gb');


# Let's start with something simple. Let's get some of the identifying

# information, let's say the locus and accession number (here the same

# thing) and the definition and the organism.

for my $line (@genbank) {

if($line =~ /^LOCUS/) {

$line =~ s/^LOCUS\s*//;

$locus = $line;

}elsif($line =~ /^DEFINITION/) {

$line =~ s/^DEFINITION\s*//;

$definition = $line;

$flag = 1;

}elsif($line =~ /^ACCESSION/) {

$line =~ s/^ACCESSION\s*//;

$accession = $line;

$flag = 0;

}elsif($flag) {

chomp($definition);

$definition .= $line;

}elsif($line =~ /^ ORGANISM/) {

$line =~ s/^\s*ORGANISM\s*//;

$organism = $line;

}

}

print "*** LOCUS ***\n";



print $locus;

print "*** DEFINITION ***\n";

print $definition;

print "*** ACCESSION ***\n";

print $accession;

print "*** ORGANISM ***\n";

print $organism;
exit;

Example 10-4 outputs:

*** LOCUS ***

AB031069 2487 bp mRNA PRI 27-MAY-2000

*** DEFINITION ***

Homo sapiens PCCX1 mRNA for protein containing CXXC domain 1, complete cds.

*** ACCESSION ***

AB031069

*** ORGANISM ***

Homo sapiens

This use of flags to remember which part of the file you're in, from one iteration of a loop to the next, is a common technique when extracting information from files that have multiline sections. As the files and their fields get more complex, the code must keep track of many flags at a time to remember which part of the file it's in and what information needs to be extracted. It works, but as the files become more complex, so does the code. It becomes hard to read and hard to modify. So let's look at regular expressions as a vehicle for parsing annotations.


10.4.2 When to Use Regular Expressions

We've used two methods to parse GenBank files: regular expressions and looping through arrays of lines and setting flags. We used both methods to separate the annotation from the sequence in a previous section of this chapter. Both methods were equally well suited, since in GenBank files, the annotation is followed by the sequence, clearly delimited by an ORIGIN line: a simple structure. However, parsing the annotations seems a bit more complicated; therefore, let's try to use regular expressions to accomplish the task.

To begin, let's wrap the code we've been working on into some convenient subroutines to focus on parsing the annotations. You'll want to fetch GenBank records one at a time from a library (a file containing one or more GenBank records), extract the annotations and the sequence, and then if desired parse the annotations. This would be useful if, say, you were looking for some motif in a GenBank library. Then you can search for the motif, and, if found, you can parse the annotations to look for additional information about the sequence.

As mentioned previously, we'll use the file library.gb, which you can download from this book's web site.

Since dealing with annotation data is somewhat complex, let's take a minute to break our tasks into convenient subroutines. Here's the pseudocode:

sub open_file

given the filename, return the filehandle

sub get_next_record

given the filehandle, get the record

(we can get the offset by first calling "tell")
sub get_annotation_and_dna

given a record, split it into annotation and cleaned-up sequence


sub search_sequence

given a sequence and a regular expression,

return array of locations of hits
sub search_annotation

given a GenBank annotation and a regular expression,

return array of locations of hits
sub parse_annotation

separate out the fields of the annotation in a convenient form

sub parse_features

given the features field, separate out the components

The idea is to make a subroutine for each important task you want to accomplish and then combine them into useful programs. Some of these can be combined into other subroutines: for instance, perhaps you want to open a file and get the record from it, all in one subroutine call.

You're designing these subroutines to work with library files, that is, files with multiple GenBank records. You pass the filehandle into the subroutines as an argument, so that your subroutines can access open library files as represented by the filehandles. Doing so enables you to have a get_next_record function, which is handy in a loop. Using the Perl function tell also allows you to save the byte offset of any record of interest, and then return later and extract the record at that byte offset very quickly. (A byte offset is just the number of characters into the file where the information of interest lies.) The operating system supports Perl in letting you go immediately to any byte offset location in even huge files, thus bypassing the usual way of opening the file and reading from the beginning until you get where you want to be.

Using a byte offset is important when you're dealing with large files. Perl gives you built-in functions such as seek that allow you, on an open file, to go immediately to any location in the file. The idea is that when you find something in a file, you can save the byte offset using the Perl function tell. Then, when you want to return to that point in the file, you can just call the Perl function seek with the byte offset as an argument. You'll see this later in this chapter when you build a DBM file to look up records based on their accession numbers. But the main point is that with a 250-MB file, it takes too long to find something by searching from the beginning, and there are ways of getting around it.

The parsing of the data is done in three steps, according to the design:


  1. You'll separate out the annotation and the sequence (which you'll clean up by removing whitespace, etc., and making it a simple string of sequence). Even at this step, you can search for motifs in the sequence, as well as look for text in the annotation.

  2. Extract out the fields.

  3. Parse the features table.

These steps seem natural, and, depending on what you want to do, allow you to parse to whatever depth is needed.

Here's a main program in pseudocode that shows how to use those subroutines:

open_file
while ( get_next_record )
get_annotation_and_dna
if ( search_sequence for a motif AND

search_annotation for chromosome 22 )


parse_annotation
parse_features to get sizes of exons, look for small sizes

}

}


return accession numbers of records meeting the criteria

This example shows how to use subroutines to answer a question such as: what are the genes on chromosome 22 that contain a given motif and have small exons?



10.4.3 Main Program

Let's test these subroutines with Example 10-5, which has some subroutine definitions that will be added to the BeginPerlBioinfo.pm module:



Example 10-5. GenBank library subroutines

#!/usr/bin/perl

# - test program of GenBank library subroutines
use strict;

use warnings;

# Don't use BeginPerlBioinfo

# Since all subroutines defined in this file

# use BeginPerlBioinfo; # see Chapter 6 about this module

# Declare and initialize variables

my $fh; # variable to store filehandle

my $record;

my $dna;


my $annotation;

my $offset;

my $library = 'library.gb';
# Perform some standard subroutines for test

$fh = open_file($library);


$offset = tell($fh);
while( $record = get_next_record($fh) ) {
($annotation, $dna) = get_annotation_and_dna($record);
if( search_sequence($dna, 'AAA[CG].')) {

print "Sequence found in record at offset $offset\n";

}

if( search_annotation($annotation, 'homo sapiens')) {



print "Annotation found in record at offset $offset\n";

}
$offset = tell($fh);

}
exit;
################################################################################

# Subroutines

################################################################################
# open_file

#

# - given filename, set filehandle


sub open_file {
my($filename) = @_;

my $fh;
unless(open($fh, $filename)) {

print "Cannot open file $filename\n";

exit;


}

return $fh;

}
# get_next_record

#

# - given GenBank record, get annotation and DNA


sub get_next_record {
my($fh) = @_;
my($offset);

my($record) = '';

my($save_input_separator) = $/;
$/ = "//\n";
$record = <$fh>;
$/ = $save_input_separator;
return $record;

}
# get_annotation_and_dna

#

# - given GenBank record, get annotation and DNA


sub get_annotation_and_dna {
my($record) = @_;
my($annotation) = '';

my($dna) = '';


# Now separate the annotation from the sequence data

($annotation, $dna) = ($record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s);

# clean the sequence of any whitespace or / characters

# (the / has to be written \/ in the character class, because

# / is a metacharacter, so it must be "escaped" with \)

$dna =~ s/[\s\/]//g;

return($annotation, $dna)

}
# search_sequence

#

# - search sequence with regular expression


sub search_sequence {
my($sequence, $regularexpression) = @_;
my(@locations) = ( );
while( $sequence =~ /$regularexpression/ig ) {

push( @locations, pos );

}
return (@locations);

}
# search_annotation

#

# - search annotation with regular expression


sub search_annotation {
my($annotation, $regularexpression) = @_;
my(@locations) = ( );
# note the /s modifier--. matches any character including newline

while( $annotation =~ /$regularexpression/isg ) {

push( @locations, pos );

}
return (@locations);

}

Example 10-5 generates the following output on our little GenBank library:

Sequence found in record at offset 0

Annotation found in record at offset 0

Sequence found in record at offset 6256

Annotation found in record at offset 6256

Sequence found in record at offset 12366

Annotation found in record at offset 12366

Sequence found in record at offset 17730

Annotation found in record at offset 17730

Sequence found in record at offset 22340

Annotation found in record at offset 22340

The tell function reports the byte offset of the file up to the point where it's been read; so you want to first call tell and then read the record to get the proper offset associated with the beginning of the record.


10.4.4 Parsing Annotations at the Top Level

Now let's parse the annotations.

There is a document from NCBI we mentioned earlier that gives the details of the structure of a GenBank record. This file is gbrel.txt and is part of the GenBank release, available at the NCBI web site and their FTP site. It's updated with every release (every two months at present), and it includes notices of changes to the format. If you program with GenBank records, you should read this document and keep a copy around for reference use, and check periodically for announced changes in the GenBank record format.

If you look back at the complete GenBank record earlier in this chapter, you'll see that the annotations have a certain structure. You have some fields, such as LOCUS, DEFINITION, ACCESSION, VERSION, KEYWORDS, SOURCE, REFERENCE, FEATURES, and BASE COUNT that start at the beginning of a line. Some of these fields have subfields, especially the FEATURES field, which has a fairly complex structure.

But for now, let's just extract the top-level fields. You will need a regular expression that matches everything from a word at the beginning of a line to a newline that just precedes another word at the beginning of a line.

Here's a regular expression that matches our definition of a field:

/^[A-Z].*\n(^\s.*\n)*/m

What does this regular expression say? First of all, it has the /m pattern matching modifier, which means the caret ^ and the dollar sign $ also match around embedded newlines (not just at the beginning and end of the entire string, which is the default behavior).

The first part of the regular expression:

^[A-Z].*\n

matches a capital letter at the beginning of a line, followed by any number of characters (except newlines), followed by a newline. That's a good description of the first lines of the fields you're trying to match.

The second part of the regular expression:

(^\s.*\n)*

matches a space or tab \s at the beginning of a line, followed by any number of characters (except newlines), followed by a newline. This is surrounded by parentheses and followed by a *, which means 0 or more such lines. This matches succeeding lines in a field, lines that start with whitespace. A field may have no extra lines of this sort or several such lines.

So, the two parts of the regular expression combined match the fields with their optional additional lines.

Example 10-6 shows a subroutine that, given the annotations section of a GenBank record stored in a scalar variable, returns a hash with keys equal to the names of the top-level fields and values equal to the contents of those fields.



1   ...   16   17   18   19   20   21   22   23   ...   28


The database is protected by copyright ©hestories.info 2017
send message

    Main page