Beginning Perl for Bioinformatics


Example 10-6. Parsing Genbank annotation



Download 1.4 Mb.
Page21/28
Date conversion29.03.2017
Size1.4 Mb.
1   ...   17   18   19   20   21   22   23   24   ...   28

Example 10-6. Parsing Genbank annotation

#!/usr/bin/perl

# - test program for parse_annotation subroutine
use strict;

use warnings;

use BeginPerlBioinfo; # see Chapter 6 about this module
# Declare and initialize variables

my $fh;


my $record;

my $dna;


my $annotation;

my %fields;

my $library = 'library.gb';
# Open library and read a record

$fh = open_file($library);


$record = get_next_record($fh);
# Parse the sequence and annotation

($annotation, $dna) = get_annotation_and_dna($record);


# Extract the fields of the annotation

%fields = parse_annotation($annotation);


# Print the fields

foreach my $key (keys %fields) {

print "******** $key *********\n";

print $fields{$key};

}
exit;
################################################################################

# Subroutine

################################################################################
# parse_annotation

#

# given a GenBank annotation, returns a hash with



# keys: the field names

# values: the fields


sub parse_annotation {
my($annotation) = @_;

my(%results) = ( );


while( $annotation =~ /^[A-Z].*\n(^\s.*\n)*/gm ) {

my $value = $&;

(my $key = $value) =~ s/^([A-Z]+).*/$1/s;

$results{$key} = $value;

}
return %results;

}

In the subroutine parse_annotation, note how the variables $key and $value are scoped within the while block. One benefit of this is that you don't have to reinitialize the variables each time through the loop. Also note that the key is the name of the field, and the value is the whole field.


You should take the time to understand the regular expression that extracts the field name for the key:

(my $key = $value) =~ s/^([A-Z]+).*/$1/s;

This first assigns $key the value $value. It then replaces everything in $key (note the /s modifier for embedded newlines) with $1, which is a special variable pattern between the first pair of parentheses ([A-Z]+). This pattern is one or more capital letters (anchored to the beginning of the string, i.e., the field name), so it sets $key to the value of the first word in the field name.

You get the following output from Example 10-6 (the test just fetches the first record in the GenBank library):

******** SOURCE *********

SOURCE Homo sapiens embryo male lung fibroblast cell_line:HuS-L12 cDNA to

mRNA.

ORGANISM Homo sapiens


Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;

Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.

******** DEFINITION *********

DEFINITION Homo sapiens PCCX1 mRNA for protein containing CXXC domain 1,

complete cds.

******** KEYWORDS *********

KEYWORDS .

******** VERSION *********

VERSION AB031069.1 GI:8100074

******** FEATURES *********

FEATURES Location/Qualifiers

source 1..2487

/organism="Homo sapiens"

/db_xref="taxon:9606"

/sex="male"

/cell_line="HuS-L12"

/cell_type="lung fibroblast"

/dev_stage="embryo"

gene 229..2199

/gene="PCCX1"

CDS 229..2199

/gene="PCCX1"

/note="a nuclear protein carrying a PHD finger and a CXXC

domain"

/codon_start=1

/product="protein containing CXXC domain 1"

/protein_id="BAA96307.1"

/db_xref="GI:8100075"

/translation="MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD

NCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEP

RDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSPQPLVATPSQHHQQQQQ

QIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCRLRQCQLRARESYKY

FPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATATP

EPLSDEDLPLDPDLYQDFCAGAFDDHGLPWMSDTEESPFLDPALRKRAVKVKHVKRRE

KKSEKKKEERYKRHRQKQKHKDKWKHPERADAKDPASLPQCLGPGCVRPAQPSSKYCS

DDCGMKLAANRIYEILPQRIQQWQQSPCIAEEHGKKLLERIRREQQSARTRLQEMERR

FHELEAIILRAKQQAVREDEESNEGDSDDTDLQIFCVSCGHPINPRVALRHMERCYAK

YESQTSFGSMYPTRIEGATRLFCDVYNPQSKTYCKRLQVLCPEHSRDPKVPADEVCGC

PLVRDVFELTGDFCRLPKRQCNRHYCWEKLRRAEVDLERVRVWYKLDELFEQERNVRT

AMTNRAGLLALMLHQTIQHDPLTTDLRSSADR"

******** REFERENCE *********

REFERENCE 2 (bases 1 to 2487)

AUTHORS Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,S. and

Takano,T.

TITLE Direct Submission

JOURNAL Submitted (15-AUG-1999) to the DDBJ/EMBL/GenBank databases.

Tadahiro Fujino, Keio University School of Medicine, Department of

Microbiology; Shinanomachi 35, Shinjuku-ku, Tokyo 160-8582, Japan

(E-mail:fujino@microb.med.keio.ac.jp,

Tel:+81-3-3353-1211(ex.62692), Fax:+81-3-5360-1508)

******** ACCESSION *********

ACCESSION AB031069

******** LOCUS *********

LOCUS AB031069 2487 bp mRNA PRI 27-MAY-2000

******** ORIGIN *********

ORIGIN


******** BASE *********

BASE COUNT 564 a 715 c 768 g 440 t

As you see, the method is working, and apart from the difficulty of reading the regular expressions (which will become easier with practice), the code is very straightforward, just a few short subroutines.

10.4.5 Parsing the FEATURES Table

Let's take this one step further and parse the features table to its next level, composed of the source , gene, and CDS features keys. (See later in this section for a more complete list of these features keys.) In the exercises at the end of the chapter, you'll be challenged to descend further into the FEATURES table.

To study the FEATURES table, you should first look over the NCBI gbrel.txt document mentioned previously. Then you should study the most complete documentation for the FEATURES table, available at http://www.ncbi.nlm.nih.gov/collab/FT/index.html.

10.4.5.1 Features

Although our GenBank entry is fairly simple and includes only three features, there are actually quite a few of them. Notice that the parsing code will find all of them, because it's just looking at the structure of the document, not for specific features.

The following is a list of the features defined for GenBank records. Although lengthy, I think it's important to read through it to get an idea of the range of information that may be present in a GenBank record.

allele

Obsolete; see variation feature key



attenuator

Sequence related to transcription termination



C_region

Span of the C immunological feature


CAAT_signal

CAAT box in eukaryotic promoters


CDS

Sequence coding for amino acids in protein (includes stop codon)



conflict

Independent sequence determinations differ



D-loop

Displacement loop



D_segment

Span of the D immunological feature



enhancer

Cis-acting enhancer of promoter function



exon

Region that codes for part of spliced mRNA



gene

Region that defines a functional gene, possibly including upstream (promoter, enhancer, etc.) and downstream control elements, and for which a name has been assigned



GC_signal

GC box in eukaryotic promoters



iDNA

Intervening DNA eliminated by recombination



intron

Transcribed region excised by mRNA splicing



J_region

Span of the J immunological feature



LTR

Long terminal repeat



mat_peptide

Mature peptide coding region (doesn't include stop codon)



misc_binding

Miscellaneous binding site



misc_difference

Miscellaneous difference feature



misc_feature

Region of biological significance that can't be described by any other feature



misc_recomb

Miscellaneous recombination feature



misc_RNA

Miscellaneous transcript feature not defined by other RNA keys



misc_signal

Miscellaneous signal



misc_structure

Miscellaneous DNA or RNA structure



modified_base

The indicated base is a modified nucleotide



mRNA

Messenger RNA



mutation

Obsolete: see variation feature key

N_region

Span of the N immunological feature



old_sequence

Presented sequence revises a previous version



polyA_signal

Signal for cleavage and polyadenylation



polyA_site

Site at which polyadenine is added to mRNA



precursor_RNA

Any RNA species that isn't yet the mature RNA product



prim_transcript

Primary (unprocessed) transcript



primer

Primer binding region used with PCR



primer_bind

Noncovalent primer binding site



promoter

A region involved in transcription initiation



protein_bind

Noncovalent protein binding site on DNA or RNA



RBS

Ribosome binding site



rep_origin

Replication origin for duplex DNA



repeat_region

Sequence containing repeated subsequences



repeat_unit

One repeated unit of a repeat_region



rRNA

Ribosomal RNA



S_region

Span of the S immunological feature



satellite

Satellite repeated sequence



scRNA

Small cytoplasmic RNA



sig_peptide

Signal peptide coding region



snRNA

Small nuclear RNA


source

Biological source of the sequence data represented by a GenBank record; mandatory feature, one or more per record; for organisms that have been incorporated within the NCBI taxonomy database, an associated /db_xref="taxon:NNNN" qualifier will be present (where NNNNN is the numeric identifier assigned to the organism within the NCBI taxonomy database)


stem_loop

Hairpin loop structure in DNA or RNA



STS

Sequence Tagged Site: operationally unique sequence that identifies the combination of primer spans used in a PCR assay



TATA_signal

TATA box in eukaryotic promoters



terminator

Sequence causing transcription termination



transit_peptide

Transit peptide coding region



transposon

Transposable element (TN)



tRNA

Transfer RNA



unsure

Authors are unsure about the sequence in this region



V_region

Span of the V immunological feature



variation

A related population contains stable mutation



-

Placeholder (hyphen)



-10_signal

Pribnow box in prokaryotic promoters



-35_signal

-35 box in prokaryotic promoters



3'clip

3'-most region of a precursor transcript removed in processing



3'UTR

3' untranslated region (trailer)



5'clip

5'-most region of a precursor transcript removed in processing



5'UTR

5' untranslated region (leader)

These feature keys can have their own additional features, which you'll see here and in the exercises.

10.4.5.2 Parsing

Example 10-7 finds whatever features are present and returns an array populated with them. It doesn't look for the complete list of features as presented in the last section; it finds just the features that are actually present in the GenBank record and returns them for further use.

It's often the case that there are multiple instances of the same feature in a record. For instance, there may be several exons specified in the FEATURES table of a GenBank record. For this reason we'll store the features as elements in an array, rather than in a hash keyed on the feature name (as this allows you to store, for instance, only one instance of an exon).


Example 10-7. Testing subroutine parse_features

#!/usr/bin/perl

# - main program to test parse_features
use strict;

use warnings;

use BeginPerlBioinfo; # see Chapter 6 about this module
# Declare and initialize variables

my $fh;


my $record;

my $dna;


my $annotation;

my %fields;

my @features;

my $library = 'library.gb';


# Get the fields from the first GenBank record in a library

$fh = open_file($library);


$record = get_next_record($fh);
($annotation, $dna) = get_annotation_and_dna($record);
%fields = parse_annotation($annotation);
# Extract the features from the FEATURES table

@features = parse_features($fields{'FEATURES'});


# Print out the features

foreach my $feature (@features) {


# extract the name of the feature (or "feature key")

my($featurename) = ($feature =~ /^ {5}(\S+)/);


print "******** $featurename *********\n";

print $feature;

}
exit;
############################################################################

# Subroutine

############################################################################
# parse_features

#

# extract the features from the FEATURES field of a GenBank record


sub parse_features {
my($features) = @_; # entire FEATURES field in a scalar variable
# Declare and initialize variables

my(@features) = (); # used to store the individual features


# Extract the features

while( $features =~ /^ {5}\S.*\n(^ {21}\S.*\n)*/gm ) {

my $feature = $&;

push(@features, $feature);

}
return @features;

}

Example 10-7 gives the output:

******** source *********

source 1..2487

/organism="Homo sapiens"

/db_xref="taxon:9606"

/sex="male"

/cell_line="HuS-L12"

/cell_type="lung fibroblast"

/dev_stage="embryo"

******** gene *********

gene 229..2199

/gene="PCCX1"

******** CDS *********

CDS 229..2199

/gene="PCCX1"

/note="a nuclear protein carrying a PHD finger and a CXXC

domain"


/codon_start=1

/product="protein containing CXXC domain 1"

/protein_id="BAA96307.1"

/db_xref="GI:8100075"

/translation="MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD

NCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEP

RDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSPQPLVATPSQHHQQQQQ

QIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCRLRQCQLRARESYKY

FPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATATP

EPLSDEDLPLDPDLYQDFCAGAFDDHGLPWMSDTEESPFLDPALRKRAVKVKHVKRRE

KKSEKKKEERYKRHRQKQKHKDKWKHPERADAKDPASLPQCLGPGCVRPAQPSSKYCS

DDCGMKLAANRIYEILPQRIQQWQQSPCIAEEHGKKLLERIRREQQSARTRLQEMERR

FHELEAIILRAKQQAVREDEESNEGDSDDTDLQIFCVSCGHPINPRVALRHMERCYAK

YESQTSFGSMYPTRIEGATRLFCDVYNPQSKTYCKRLQVLCPEHSRDPKVPADEVCGC

PLVRDVFELTGDFCRLPKRQCNRHYCWEKLRRAEVDLERVRVWYKLDELFEQERNVRT

AMTNRAGLLALMLHQTIQHDPLTTDLRSSADR

In subroutine parse_features of Example 10-7, the regular expression that extracts the features is much like the regular expression used in Example 10-6 to parse the top level of the annotations. Let's look at the essential parsing code of Example 10-7:

while( $features =~ /^ {5}\S.*\n(^ {21}\S.*\n)*/gm ) {

On the whole, and in brief, this regular expression finds features formatted with the first lines beginning with 5 spaces, and optional continuation lines beginning with 21 spaces.

First, note that the pattern modifier /m enables the ^ metacharacter to match after embedded newlines. Also, the {5} and {21} are quantifiers that specify there should be exactly 5, or 21, of the preceding item, which in both cases is a space.

The regular expression is in two parts, corresponding to the first line and optional continuation lines of the feature. The first part ^ {5}\S.*\n means that the beginning of a line (^) has 5 spaces ({5}), followed by a non-whitespace character (\S) followed by any number of non-newlines (.*) followed by a newline (\n). The second part of the regular expression, (^ {21}\S.*\n)* means the beginning of a line (^) has 21 spaces ({21}) followed by a non-whitespace character (\S) followed by any number of non-newlines (.*) followed by a newline (\n); and there may be 0 or more such lines, indicated by the ()* around the whole expression.

The main program has a short regular expression along similar lines to extract the feature name (also called the feature key) from the feature.

So, again, success. The FEATURES table is now decomposed or "parsed" in some detail, down to the level of separating the individual features. The next stage in parsing the FEATURES table is to extract the detailed information for each feature. This includes the location (given on the same line as the feature name, and possibly on additional lines); and the qualifiers indicated by a slash, a qualifier name, and if applicable, an equals sign and additional information of various kinds, possibly continued on additional lines.

I'll leave this final step for the exercises. It's a fairly straightforward extension of the approach we've been using to parse the features. You will want to consult the documentation from the NCBI web site for complete details about the structure of the FEATURES table before trying to parse the location and qualifiers from a feature.

The method I've used to parse the FEATURES table maintains the structure of the information. However, sometimes you just want to see if some word such as "endonulease" appears anywhere in the record. For this, recall that you created a search_annotation subroutine in Example 10-5 that searches for any regular expression in the entire annotation; very often, this is all you really need. As you've now seen, however, when you really need to dig into the FEATURES table, Perl has its own features that make the job possible and even fairly easy.

10.5 Indexing GenBank with DBM

DBM stands for Database Management. Perl provides a set of built-in functions that give Perl programmers access to DBM files.



10.5.1 DBM Essentials

When you open a DBM file, you access it like a hash: you give it keys and it returns values, and you can add and delete key-value pairs. What's useful about DBM is that it saves the key-value data in a permanent disk file on your computer. It can thus save information between the times you run your program; it can also serve as a way to share information between different programs that need the same data. A DBM file can get very big without killing the main memory on your computer and making your program—and everything else—slow to a crawl.

There are two functions, dbmopen and dbmclose, that "tie" a hash to a DBM file; then you just use the hash. As you've seen, with a hash, lookups are easy, as are definitions. You can get a list of all the keys from a hash called %my_hash by typing keys %my_hash. You then can get a list of all values by typing values %my_hash. For large DBM files, you may not want to do this; the Perl function each allows you to read key-value pairs one at a time, thus saving the memory of your running program. There is also a delete function to remove the definitions of keys:

delete $my_hash{'DNA'}

entirely removes that key from the hash.

DBM files are a very simple database. They don't have the power of a relational database such as MySQL , Oracle, or PostgreSQL ; however, it's remarkable how often a simple database is all that a problem really needs. When you have a set of key-value data (or several such sets), consider using DBM. It's certainly easy to use with Perl.

The main wrinkle to using DBM is that there are several, slightly different DBM implementations—NDBM, GDBM, SDBM, and Berkeley DB. The differences are small but real; but for most purposes, the implementations are interchangeable. Newer versions of Perl give you Berkeley DB by default, and it's easy to get it and install it for your Perl if you want. If you don't have really long keys or values, it's not a problem. Some older DBMs require you to add null bytes to keys and delete them from values:

$value = $my_hash{"$key\0"};

chop $value;

Chances are good that you won't have to do that. Berkeley DB handles long strings well (some of the other DBM implementations have limits), and because you have some potentially long strings in biology, I recommend installing Berkeley DB if you don't have it.


10.5.2 A DBM Database for GenBank

You've seen how to extract information from a GenBank record or from a library of GenBank records. You've just seen how DBM files can save your hash data on your hard disk between program runs. You've also seen the use of tell and seek to quickly access a location in a file.

Now let's combine the three ideas and use DBM to build a database of information about a GenBank library. It'll be something simple: you'll extract the accession numbers for the keys and store the byte offsets in the GenBank library of records for the values. You'll add some code that, given a library and an offset, returns the record at that offset, and write a main program that allows the user to interactively request GenBank records by accession number. When complete, your program should very quickly return a GenBank record if given its accession number.

This general idea is extended in the exercises at the end of the chapter to a considerable extent; you may want to glance ahead at them now to get an idea of the potential power of the technique I'm about to present.

With just the appropriate amount of further ado, here is a code fragment that opens (creating if necessary) a DBM file:

unless(dbmopen(%my_hash, 'DBNAME', 0644)) {


print "Cannot open DBM file DBNAME with mode 0644\n";

exit;

}

%my_hash is like any other hash in Perl, but it will be tied to the DBM file with this statement. DBNAME is the basename of the actual DBM files that will be created. Some DBM versions create one file of exactly that name; others create two files with file extensions .dir and .pag.

Another parameter is called the mode. Unix or Linux users will be familiar with file permissions in this form. Many possibilities exist; here are the most common ones:


0644

You can read and write; others can just read.



0600

Only you can read or write.



0666

Anyone can read or write.



0444

Anyone can read (nobody can write).



0400

Only you can read (nobody else can do anything).

The dbmopen call fails if you try to open a file with a mode that assumes there are more permissions than were conferred on the DBM file when it was created. Usually, the mode 0644 is declared by the owner if only the owner should be allowed to write, and 0444 is declared by readers. Mode 0666 is declared by the owner and others if the file is meant to be read or written by anyone.

That's pretty much it; DBM files are that simple. Example 10-8 displays a DBM file that stores key-value pairs of accession numbers of GenBank records for keys, and byte offsets of the records as values.



1   ...   17   18   19   20   21   22   23   24   ...   28


The database is protected by copyright ©hestories.info 2017
send message

    Main page