Beginning Perl for Bioinformatics


Example 10-8. A DBM index of a GenBank library



Download 1.4 Mb.
Page22/28
Date conversion29.03.2017
Size1.4 Mb.
1   ...   18   19   20   21   22   23   24   25   ...   28

Example 10-8. A DBM index of a GenBank library

#!/usr/bin/perl

# - make a DBM index of a GenBank library,

# and demonstrate its use interactively


use strict;

use warnings;

use BeginPerlBioinfo; # see Chapter 6 about this module
# Declare and initialize variables

my $fh;


my $record;

my $dna;


my $annotation;

my %fields;

my %dbm;

my $answer;

my $offset;

my $library = 'library.gb';


# open DBM file, creating if necessary

unless(dbmopen(%dbm, 'GB', 0644)) {

print "Cannot open DBM file GB with mode 0644\n";

exit;


}
# Parse GenBank library, saving accession number and offset in DBM file

$fh = open_file($library);


$offset = tell($fh);
while ( $record = get_next_record($fh) ) {
# Get accession field for this record.

($annotation, $dna) = get_annotation_and_dna($record);


%fields = parse_annotation($annotation);
my $accession = $fields{'ACCESSION'};
# extract just the accession number from the accession field

# --remove any trailing spaces

$accession =~ s/^ACCESSION\s*//;
$accession =~ s/\s*$//;
# store the key/value of accession/offset

$dbm{$accession} = $offset;


# get offset for next record

$offset = tell($fh);

}
# Now interactively query the DBM database with accession numbers

# to see associated records


print "Here are the available accession numbers:\n";
print join ( "\n", keys %dbm ), "\n";
print "Enter accession number (or quit): ";

while( $answer = ) {

chomp $answer;

if($answer =~ /^\s*q/) {

last;


}

$offset = $dbm{$answer};


if ($offset) {

seek($fh, $offset, 0);

$record = get_next_record($fh);

print $record;

}else{

print "Do not have an entry for accession number $answer\n";



}
print "\nEnter accession number (or quit): ";

}
dbmclose(%dbm);


close($fh);
exit;

Here's the truncated output of Example 10-8:

Here are the available accession numbers:
XM_006271

NM_021964

XM_009873

AB031069


XM_006269
Enter accession number (or quit): NM_021964
LOCUS NM_021964 3032 bp mRNA PRI 14-MAR-2001

DEFINITION Homo sapiens zinc finger protein 148 (pHZ-52) (ZNF148), mRNA.

...

//
Enter accession number (or quit): q


10.6 Exercises

Exercise 10.1

Go to the NCBI, EMBL, and EBI web sites and become familiar with their use.



Exercise 10.2

Read the GenBank format documentation, gbrel.txt.



Exercise 10.3

Write a subroutine that passes a hash by value. Now rewrite it to pass the hash by reference.


Exercise 10.4

Design a module of subroutines to handle the following kinds of data: a flat file containing records consisting of gene names on a line and extra information of any sort on succeeding lines, followed by a blank line. Your subroutines should be able to read in the data and then do a fast lookup on the information associated with a gene name. You should also be able to add new records to the flat file. Now reuse this module to build an address book program.


Exercise 10.5

Descend further into the FEATURES table. Parse the features in the table into their next level by parsing the feature names, locations, and qualifiers. Check the document gbrel.txt for definitions of the structures of the fields.



Exercise 10.6

Write a program that takes a long DNA sequence as input and outputs the counts of all four-base subsequences (256 of them in all), sorted by frequency. A four-base subsequence starts at each location 1, 2, 3, and so on. (This kind of word-frequency analysis is common to many fields of study, including linguistics, computer science, and music.)



Exercise 10.7

Extend the program in Exercise 10.6 to count all the sequences in a GenBank library.



Exercise 10.8

Given an amino acid, find the frequency of occurrence of the adjacent amino acids coded in a DNA sequence; or in a GenBank library.



Exercise 10.10

Extract all the words (excluding words like "the" or other unnecessary words) from the annotation of a library of GenBank records. For each word found, add the offset of the GenBank record in the library to a DBM file that has keys equal to the words, and values that are strings with offsets separated by spaces. In other words, one key can have a space-separated list of offsets for a value. Then you can quickly find all records containing a word like "fibroblast" with a simple lookup, followed by extracting the offsets and seeking into the library with those offsets. How big is your DBM file compared to the GenBank library? What might be involved in constructing a search engine for the annotations in all of GenBank? For human DNA only?


Exercise 10.10

Write a program to make a custom library of oncogenes from the GBPRI division of GenBank.


Chapter 11. Protein Data Bank
The success of the Human Genome Project in decoding the DNA sequence of human genes has captured the public imagination, but another project has been quietly gaining momentum, and it promises equally revolutionary results. This project is an international effort to determine the 3D structure of a comprehensive range of proteins on a genome-wide level using high-throughput analytical technologies. This international effort is the foundation of the new field of structural genomics.

Recent and expected advances in technology promise an accelerating pace of protein structure determination. The storehouse for all of this data is the Protein Data Bank (PDB). The PDB may be found on the web at http://www.rcsb.org/pdb/.

Finding the amino acid or primary sequence is just the beginning of studying a protein. Proteins fold locally into secondary structures such as alpha helices, beta-strands, and turns. Two or three adjacent secondary structures might combine into common local folds called " motifs" or "supersecondary" structures such as beta sheets or alpha-alpha units. These building blocks then fold into the 3D or tertiary structure of a protein. Finally, one or more tertiary structures may be combined as subunits into a quaternary structure such as an enzyme or a virus.

Without knowing how a protein folds into a 3D structure, you are less likely to know what the protein does or how it does it. Even if you know that the protein is implicated in a disease, knowledge of its tertiary structure is usually needed to find a possible treatment. Knowing the tertiary conformation of the active site of a protein (which may involve amino acids that are far apart in terms of the primary sequence but which are brought together by the folding of the protein) is critical to guide the selection of targets for new drugs.

Now that the basic genetic information of a number of organisms, including humans, has been decoded, a primary challenge facing biologists is to learn as much as possible about the proteins those genes produce and how they interact.

In fact, one of the great questions of modern biology is how the primary amino acid sequence of a protein determines its ultimate 3D shape. If a computational method can be found to reliably predict the fold of a protein from its amino acid sequence, the effect on biology and medicine would be profound.

In this chapter, you'll learn the basics of PDB files and how to parse out selected information form them. You'll also explore interesting Perl techniques for finding and iterating over lots of files, as well as controlling other bioinformatics programs from a Perl program. The exercises at the end of the chapter challenge you to extend the introductory material presented here to gain access to more of the PDB data.

11.1 Overview of PDB

The main source for information about 3D structures of macromolecules (including proteins, peptides, viruses, protein/nucleic acid complexes, nucleic acids, and carbohydrates) is PDB, and its format is the de facto standard for the exchange of structural information. Most of these structures are determined experimentally by means of X-ray diffraction or nuclear magnetic resonance (NMR) studies.

PDB started in 1971 with seven proteins; it will soon grow to 20,000 structures. With the international effort in structural genomics increasing, the PDB is certain to continue its rapid growth. Within a few short years the number of known structures will approach 100,000.

PDB files are like GenBank records, in that they are human-readable ASCII flat files. The text conforms to a specific format, so computer programs may be written to extract the information. PDB is organized with one structure per file, unlike Genbank, which is distributed with many records in each "library" file.

Bioinformaticians who work extensively with PDB files report that there are serious problems with the consistency of the PDB format. For instance, as the field has advanced and the data format has evolved to meet new knowledge requirements, some of the older files have become out of date, and efforts are underway to address the uniformity of PDB data. Until these efforts are complete and a new data format is developed, inconsistencies in the current data format are a challenge programmers have to face. If you do a lot of programming with PDB files, you'll find many inconsistencies and errors in the data, especially in the older files. Plus, many parsing tools that work well on newer files perform poorly on older files.

As you become a more experienced programmer, these and other issues the PDB faces become more important. For instance, as PDB evolves, the code you write to interact with it must also evolve; you must always maintain your code with an eye on how the rest of the world is changing. As links between databases become better supported, your code will take advantage of the new opportunities the links provide. With new standards of data storage becoming established, your code will have to evolve to include them.

The PDB web site contains a wealth of information on how to download all the files. They are also conveniently distributed—and at no cost—on a set of CDs, which is a real advantage for those lacking high-throughput Internet connections.


11.2 Files and Folders

The PDB is distributed as files within directories. Each protein structure occupies its own file. PDB contains a huge amount of data, and it can be a challenge to deal with it. So in this section, you'll learn to deal with large numbers of files organized in directories and subdirectories.

You'll frequently find a need to write programs that manipulate large numbers of files. For example: perhaps you keep all your sequencing runs in a directory, organized into subdirectories labeled by the dates of the sequencing runs and containing whatever the sequencer produced on those days. After a few years, you could have quite a number of files.

Then, one day you discover a new sequence of DNA that seems to be implicated in cell division. You do a BLAST search (see Chapter 12) but find no significant hits for your new DNA. At that point you want to know whether you've seen this DNA before in any previous sequencing runs.[1] What you need to do is run a comparison subroutine on each of the hundreds or thousands of files in all your various sequencing run subdirectories. But that's going to take several days of repetitive, boring work sitting at the computer screen.



[1] You may do a comparison by keeping copies of all your sequencing runs in one large BLAST library; building such a BLAST library can be done using the techniques shown in this section.

You can write a program in much less time than that! Then all you have to do is sit back and examine the results of any significant matches your program finds. To write the program, however, you have to know how to manipulate all the files and folders in Perl. The following sections show you how to do it.


11.2.1 Opening Directories

A filesystem is organized in a tree structure. The metaphor is apt. Starting from anyplace on the tree, you can proceed up the branches and get to any leaves that stem from your starting place. If you start from the root of the tree, you can reach all the leaves. Similarly, in a filesystem, if you start at a certain directory, you can reach all the files in all the subdirectories that stem from your starting place, and if you start at the root (which, strangely enough, is also called the "top") of the filesystem, you can reach all the files.

You've already had plenty of practice opening, reading from, writing to, and closing files. I will show a simple method with which you can open a folder (also called a directory) and get the filenames of all the files in that folder. Following that, you'll see how to get the names of all files from all directories and subdirectories from a certain starting point.

Let's look at the Perlish way to list all the files in a folder, beginning with some pseudocode:

open folder

read contents of folder (files and subfolders)
print their names

Example 11-1 shows the actual Perl code.

Example 11-1. Listing the contents of a folder (or directory)

#!/usr/bin/perl

# Demonstrating how to open a folder and list its contents
use strict;

use warnings;

use BeginPerlBioinfo; # see Chapter 6 about this module
my @files = ( );

my $folder = 'pdb';


# open the folder

unless(opendir(FOLDER, $folder)) {

print "Cannot open folder $folder!\n";

exit;


}
# read the contents of the folder (i.e. the files and subfolders)

@files = readdir(FOLDER);


# close the folder

closedir(FOLDER);


# print them out, one per line

print join( "\n", @files), "\n";


exit;

Since you're running this program on a folder that contains PDB files, this is what you'll see:

.

..

3c



44

pdb1a4o.ent

If you want to list the files in the current directory, you can give the directory name the special name "." for the current directory, like so:

my $folder = '.';

On Unix or Linux systems, the special files "." and ".." refer to the current directory and the parent directory, respectively. These aren't "really" files, at least not files you'd want to read; you can avoid listing them with the wonderful and amazing grep function. grep allows you to select elements from an array based on a test, such as a regular expression. Here's how to filter out the array entries "." and "..":

@files = grep( !/^\.\.?$/, @files);


grep selects all lines that don't match the regular expression, due to the negation operator written as the exclamation mark. The regular expression /^\.\.?$/ is looking for a line that begins with (the beginning of a line is indicated with the ^ metacharacter) a period \. (escaped with a backslash since a period is a metacharacter) followed by 0 or 1 periods \.? (the ? matches 0 or 1 of the preceding items), and nothing more (indicated by the $ end-of-string metacharacter).

In fact, this is so often used when reading a directory that it's usually combined into one step:

@files = grep (!/^\.\.?$/, readdir(FOLDER));

Okay, now all the files are listed. But wait: what if some of these files aren't files at all but are subfolders? You can use the handy file test operators to test each filename and then even open each subfolder and list the files in them. First, some pseudocode:

open folder

for each item in the folder
if it's a file

print its name


else if it's a folder

open the folder

print the names of the contents of the folder

}

}



Example 11-2 shows the program.

Example 11-2. List contents of a folder and its subfolders

#!/usr/bin/perl

# Demonstrating how to open a folder and list its contents

# --distinguishing between files and subfolders, which

# are themselves listed
use strict;

use warnings;

use BeginPerlBioinfo; # see Chapter 6 about this module
my @files = ( );

my $folder = 'pdb';


# Open the folder

unless(opendir(FOLDER, $folder)) {

print "Cannot open folder $folder!\n";

exit;


}
# Read the folder, ignoring special entries "." and ".."

@files = grep (!/^\.\.?$/, readdir(FOLDER));


closedir(FOLDER);
# If file, print its name

# If folder, print its name and contents

#

# Notice that we need to prepend the folder name!



foreach my $file (@files) {
# If the folder entry is a regular file

if (-f "$folder/$file") {

print "$folder/$file\n";
# If the folder entry is a subfolder

}elsif( -d "$folder/$file") {


my $folder = "$folder/$file";
# open the subfolder and list its contents

unless(opendir(FOLDER, "$folder")) {

print "Cannot open folder $folder!\n";

exit;


}

my @files = grep (!/^\.\.?$/, readdir(FOLDER));

closedir(FOLDER);

foreach my $file (@files) {

print "$folder/$file\n";

}

}



}

exit;

Here's the output of Example 11-2:

pdb/3c/pdb43c9.ent

pdb/3c/pdb43ca.ent

pdb/44/pdb144d.ent

pdb/44/pdb144l.ent

pdb/44/pdb244d.ent

pdb/44/pdb244l.ent

pdb/44/pdb344d.ent

pdb/44/pdb444d.ent

pdb/pdb1a4o.ent

Notice how variable names such as $file and @files have been reused in this code, using lexical scoping in the inner blocks with my. If the overall structure of the program wasn't so short and simple, this could get really hard to read. When the program says $file, does it mean this $file or that $file? This code is an example of how to get into trouble. It works, but it's hard to read, despite its brevity.

In fact, there's a deeper problem with Example 11-2. It's not well designed. By extending Example 11-1, it can now list subdirectories. But what if there are further levels of subdirectories?


11.2.2 Recursion

If you have a subroutine that lists the contents of directories and recursively calls itself to list the contents of any subdirectories it finds, you can call it on the top-level directory, and it eventually lists all the files.

Let's write another program that does just that. A recursive subroutine is defined simply as a subroutine that calls itself. Here is the pseudocode and the code (Example 11-3) followed by a discussion of how recursion works:

subroutine list_recursively


open folder
for each item in the folder
if it's a file

print its name


else if it's a folder

list_recursively

}

}

Example 11-3. A recursive subroutine to list a filesystem



#!/usr/bin/perl

# Demonstrate a recursive subroutine to list a subtree of a filesystem


use strict;

use warnings;

use BeginPerlBioinfo; # see Chapter 6 about this module
list_recursively('pdb');
exit;
################################################################################

# Subroutine

################################################################################
# list_recursively

#

# list the contents of a directory,



# recursively listing the contents of any subdirectories
sub list_recursively {
my($directory) = @_;
my @files = ( );

# Open the directory

unless(opendir(DIRECTORY, $directory)) {

print "Cannot open directory $directory!\n";

exit;

}

# Read the directory, ignoring special entries "." and ".."


@files = grep (!/^\.\.?$/, readdir(DIRECTORY));

closedir(DIRECTORY);

# If file, print its name

# If directory, recursively print its contents

# Notice that we need to prepend the directory name!

foreach my $file (@files) {

# If the directory entry is a regular file

if (-f "$directory/$file") {

print "$directory/$file\n";

# If the directory entry is a subdirectory

}elsif( -d "$directory/$file") {
# Here is the recursive call to this subroutine

list_recursively("$directory/$file");

}

}

}


Here's the output of Example 11-3 (notice that it's the same as the output of Example 11-2):

pdb/3c/pdb43c9.ent

pdb/3c/pdb43ca.ent

pdb/44/pdb144d.ent

pdb/44/pdb144l.ent

pdb/44/pdb244d.ent

pdb/44/pdb244l.ent

pdb/44/pdb344d.ent

pdb/44/pdb444d.ent

pdb/pdb1a4o.ent

Look over the code for Example 11-3 and compare it to Example 11-2. As you can see, the programs are largely identical. Example 11-2 is all one main program; Example 11-3 has almost identical code but has packaged it up as a subroutine that is called by a short main program. The main program of Example 11-3 simply calls a recursive function, giving it a directory name (for a directory that exists on my computer; you may need to change the directory name when you attempt to run this program on your own computer). Here is the call:

list_recursively('pdb');

I don't know if you feel let down, but I do. This looks just like any other subroutine call. Clearly, the recursion must be defined within the subroutine. It's not until the very end of the list_recursively subroutine, where the program finds (using the -d file test operator) that one of the contents of the directory that it's listing is itself a directory, that there's a significant difference in the code as compared with Example 11-2. At that point, Example 11-2 has code to once again look for regular files or for directories. But this subroutine in Example 11-3 simply calls a subroutine, which happens to be itself, namely, list_recursively:

list_recursively("$directory/$file");

That's recursion.

As you've seen here, there are times when the data—for instance, the hierarchical structure of a filesystem—is well matched by the capabilities of recursive programs. The fact that the recursive call happens at the end of the subroutine means that it's a special type of recursion called tail recursion. Although recursion can be slow, due to all the subroutine calls it can create, the good news about tail recursion is that many compilers can optimize the code to make it run much faster. Using recursion can result in clean, short, easy-to-understand programs. (Although Perl doesn't yet optimize it, current plans for Perl 6 include support for optimizing tail recursion.)


11.2.3 Processing Many Files

Perl has modules for a variety of tasks. Some come standard with Perl; more can be installed after obtaining them from CPAN or elsewhere: http://www.CPAN.org/.



Example 11-3 in the previous section showed how to locate all files and directories under a given directory. There's a module that is standard in any recent version of Perl called File::Find. You can find it in your manual pages: on Unix or Linux, for instance, you issue the command perldoc File::Find. This module makes it easy—and efficient—to process all files under a given directory, performing whatever operations you specify.

Example 11-4 uses File::Find. Consult the documentation for more examples of this useful module. The example shows the same functionality as Example 11-3 but now uses File::Find. It simply lists the files and directories. Notice how much less code you have to write if you find a good module, ready to use!



1   ...   18   19   20   21   22   23   24   25   ...   28


The database is protected by copyright ©hestories.info 2017
send message

    Main page