Beginning Perl for Bioinformatics


Example 11-4. Demonstrate File::Find



Download 1.4 Mb.
Page23/28
Date conversion29.03.2017
Size1.4 Mb.
1   ...   20   21   22   23   24   25   26   27   28

Example 11-4. Demonstrate File::Find

#!/usr/bin/perl

# Demonstrate File::Find
use strict;

use warnings;

use BeginPerlBioinfo; # see Chapter 6 about this module
use File::Find;
find ( \&my_sub, ('pdb') );
sub my_sub {

-f and (print $File::Find::name, "\n");

}

exit;

Notice that a reference is passed to the my_sub subroutine by prefacing it with the backslash character. You also need to preface the name with the ampersand character, as mentioned in Chapter 6.

The call to find can also be done like this:

find sub { -f and (print $File::Find::name, "\n") }, ('pdb');

This puts an anonymous subroutine in place of the reference to the my_sub subroutine, and it's a convenience for these types of short subroutines.

Here's the output:

pdb/pdb1a4o.ent

pdb/44/pdb144d.ent

pdb/44/pdb144l.ent

pdb/44/pdb244d.ent

pdb/44/pdb244l.ent

pdb/44/pdb344d.ent

pdb/44/pdb444d.ent

pdb/3c/pdb43c9.ent

pdb/3c/pdb43ca.ent

As a final example of processing files with Perl, here's the same functionality as the preceding programs, with a one-line program, issued at the command line:

perl -e 'use File::Find;find sub{-f and (print $File::Find::name,"\n")},("pdb")'

Pretty cool, for those who admire terseness, although it doesn't really eschew obfuscation. Also note that for those on Unix systems, ls -R pdb and find pdb -print do the same thing with even less typing.

The reason for using a subroutine that you define is that it enables you to perform any arbitrary tests on the files you find and then take any actions with those files. It's another case of modularization: the File::Find module makes it easy to recurse over all the files and directories in a file structure and lets you do as you wish with the files and directories you find.

11.3 PDB Files

Here's a section of an actual PDB file:

HEADER SUGAR BINDING PROTEIN 03-MAR-99 1C1F

TITLE LIGAND-FREE CONGERIN I

COMPND MOL_ID: 1;

COMPND 2 MOLECULE: CONGERIN I;

COMPND 3 CHAIN: A;

COMPND 4 FRAGMENT: CARBOHYDRATE-RECOGNITION-DOMAIN;

COMPND 5 BIOLOGICAL_UNIT: HOMODIMER

SOURCE MOL_ID: 1;

SOURCE 2 ORGANISM_SCIENTIFIC: CONGER MYRIASTER;

SOURCE 3 ORGANISM_COMMON: CONGER EEL;

SOURCE 4 TISSUE: SKIN MUCUS;

SOURCE 5 SECRETION: NON-CLASSICAL

KEYWDS GALECTIN, LECTIN, BETA-GALACTOSE-BINDING, SUGAR BINDING

KEYWDS 2 PROTEIN

EXPDTA X-RAY DIFFRACTION

AUTHOR T.SHIRAI,C.MITSUYAMA,Y.NIWA,Y.MATSUI,H.HOTTA,T.YAMANE,

AUTHOR 2 H.KAMIYA,C.ISHII,T.OGAWA,K.MURAMOTO

REVDAT 2 14-OCT-99 1C1F 1 SEQADV HEADER

REVDAT 1 08-OCT-99 1C1F 0

JRNL AUTH T.SHIRAI,C.MITSUYAMA,Y.NIWA,Y.MATSUI,H.HOTTA,

JRNL AUTH 2 T.YAMANE,H.KAMIYA,C.ISHII,T.OGAWA,K.MURAMOTO

JRNL TITL HIGH-RESOLUTION STRUCTURE OF CONGER EEL GALECTIN,

JRNL TITL 2 CONGERIN I, IN LACTOSE- LIGANDED AND LIGAND-FREE

JRNL TITL 3 FORMS: EMERGENCE OF A NEW STRUCTURE CLASS BY

JRNL TITL 4 ACCELERATED EVOLUTION

JRNL REF STRUCTURE (LONDON) V. 7 1223 1999

JRNL REFN ASTM STRUE6 UK ISSN 0969-2126 2005

REMARK 1

REMARK 2

REMARK 2 RESOLUTION. 1.6 ANGSTROMS.

REMARK 3

REMARK 3 REFINEMENT.

REMARK 3 PROGRAM : X-PLOR 3.1

REMARK 3 AUTHORS : BRUNGER

REMARK 3

REMARK 3 DATA USED IN REFINEMENT.

REMARK 3 RESOLUTION RANGE HIGH (ANGSTROMS) : 1.60

REMARK 3 RESOLUTION RANGE LOW (ANGSTROMS) : 8.00

REMARK 3 DATA CUTOFF (SIGMA(F)) : 3.000

REMARK 3 DATA CUTOFF HIGH (ABS(F)) : NULL

REMARK 3 DATA CUTOFF LOW (ABS(F)) : NULL

REMARK 3 COMPLETENESS (WORKING+TEST) (%) : 85.0

REMARK 3 NUMBER OF REFLECTIONS : 17099

REMARK 3

REMARK 3

REMARK 3 FIT TO DATA USED IN REFINEMENT.

REMARK 3 CROSS-VALIDATION METHOD : THROUGHOUT

REMARK 3 FREE R VALUE TEST SET SELECTION : RANDOM

REMARK 3 R VALUE (WORKING SET) : 0.201

REMARK 3 FREE R VALUE : 0.247

REMARK 3 FREE R VALUE TEST SET SIZE (%) : 5.000

REMARK 3 FREE R VALUE TEST SET COUNT : 855

REMARK 3 ESTIMATED ERROR OF FREE R VALUE : NULL

REMARK 3

...


(file truncated here)

REMARK 4 REMARK 4 1C1F COMPLIES WITH FORMAT V. 2.3, 09-JULY-1998 REMARK 7 REMARK 7 >>> WARNING: CHECK REMARK 999 CAREFULLY REMARK 8 REMARK 8 SIDE-CHAINS OF SER123 AND LEU124 ARE MODELED AS ALTERNATIVE REMARK 8 CONFORMERS. REMARK 9 REMARK 9 SER1 IS ACETYLATED. REMARK 10 REMARK 10 TER REMARK 10 SER: THE N-TERMINAL RESIDUE WAS NOT OBSERVED REMARK 100 REMARK 100 THIS ENTRY HAS BEEN PROCESSED BY RCSB ON 07-MAR-1999. REMARK 100 THE RCSB ID CODE IS RCSB000566. REMARK 200 REMARK 200 EXPERIMENTAL DETAILS REMARK 200 EXPERIMENT TYPE : X-RAY DIFFRACTION REMARK 200 DATE OF DATA COLLECTION : NULL REMARK 200 TEMPERATURE (KELVIN) : 291.0 REMARK 200 PH : 9.00 REMARK 200 NUMBER OF CRYSTALS USED : 1 REMARK 200 REMARK 200 SYNCHROTRON (Y/N) : Y REMARK 200 RADIATION SOURCE : PHOTON FACTORY REMARK 200 BEAMLINE : BL6A REMARK 200 X-RAY GENERATOR MODEL : NULL REMARK 200 MONOCHROMATIC OR LAUE (M/L) : M REMARK 200 WAVELENGTH OR RANGE (A) : 1.00 REMARK 200 MONOCHROMATOR : NULL REMARK 200 OPTICS : NULL REMARK 200 ...


(file truncated here)

REMARK 500 REMARK 500 GEOMETRY AND STEREOCHEMISTRY REMARK 500 SUBTOPIC: COVALENT BOND ANGLES REMARK 500 REMARK 500 THE STEREOCHEMICAL PARAMETERS OF THE FOLLOWING RESIDUES REMARK 500 HAVE VALUES WHICH DEVIATE FROM EXPECTED VALUES BY MORE REMARK 500 THAN 4*RMSD (M=MODEL NUMBER; RES=RESIDUE NAME; C=CHAIN REMARK 500 IDENTIFIER; SSEQ=SEQUENCE NUMBER; I=INSERTION CODE). REMARK 500 REMARK 500 STANDARD TABLE: REMARK 500 FORMAT: (10X,I3,1X,A3,1X,A1,I4,A1,3(1X,A4,2X),12X,F5.1) REMARK 500 REMARK 500 EXPECTED VALUES: ENGH AND HUBER, 1991 REMARK 500 REMARK 500 M RES CSSEQI ATM1 ATM2 ATM3 REMARK 500 HIS A 44 N - CA - C ANGL. DEV. =-10.3 DEGREES REMARK 500 LEU A 132 CA - CB - CG ANGL. DEV. = 12.5 DEGREES REMARK 700 REMARK 700 SHEET REMARK 700 DETERMINATION METHOD: AUTHOR-DETERMINED REMARK 999 REMARK 999 SEQUENCE REMARK 999 LEU A 135 IS NOT PRESENT IN SEQUENCE DATABASE REMARK 999 DBREF 1C1F A 1 136 SWS P26788 LEG_CONMY 1 135 SEQADV 1C1F LEU A 135 SWS P26788 SEE REMARK 999 SEQRES 1 A 136 SER GLY GLY LEU GLN VAL LYS ASN PHE ASP PHE THR VAL SEQRES 2 A 136 GLY LYS PHE LEU THR VAL GLY GLY PHE ILE ASN ASN SER SEQRES 3 A 136 PRO GLN ARG PHE SER VAL ASN VAL GLY GLU SER MET ASN SEQRES 4 A 136 SER LEU SER LEU HIS LEU ASP HIS ARG PHE ASN TYR GLY SEQRES 5 A 136 ALA ASP GLN ASN THR ILE VAL MET ASN SER THR LEU LYS SEQRES 6 A 136 GLY ASP ASN GLY TRP GLU THR GLU GLN ARG SER THR ASN SEQRES 7 A 136 PHE THR LEU SER ALA GLY GLN TYR PHE GLU ILE THR LEU SEQRES 8 A 136 SER TYR ASP ILE ASN LYS PHE TYR ILE ASP ILE LEU ASP SEQRES 9 A 136 GLY PRO ASN LEU GLU PHE PRO ASN ARG TYR SER LYS GLU SEQRES 10 A 136 PHE LEU PRO PHE LEU SER LEU ALA GLY ASP ALA ARG LEU SEQRES 11 A 136 THR LEU VAL LYS LEU GLU FORMUL 2 HOH *81(H2 O1) HELIX 1 1 GLY A 66 ASN A 68 5 3 SHEET 1 S1 1 GLY A 3 VAL A 6 0 SHEET 1 S2 1 PHE A 121 GLY A 126 0 SHEET 1 S3 1 ARG A 29 GLY A 35 0 SHEET 1 S4 1 LEU A 41 ASN A 50 0 SHEET 1 S5 1 GLN A 55 THR A 63 0 SHEET 1 S6 1 GLN A 74 SER A 76 0 SHEET 1 F1 1 ALA A 128 GLU A 136 0 SHEET 1 F2 1 PHE A 16 ILE A 23 0 SHEET 1 F3 1 TYR A 86 TYR A 93 0 SHEET 1 F4 1 LYS A 97 ILE A 102 0 SHEET 1 F5 1 ASN A 107 PRO A 111 0 CRYST1 94.340 36.920 40.540 90.00 90.00 90.00 P 21 21 2 4 ORIGX1 1.000000 0.000000 0.000000 0.00000 ORIGX2 0.000000 1.000000 0.000000 0.00000 ORIGX3 0.000000 0.000000 1.000000 0.00000 SCALE1 0.010600 0.000000 0.000000 0.00000 SCALE2 0.000000 0.027085 0.000000 0.00000 SCALE3 0.000000 0.000000 0.024667 0.00000 ATOM 1 N GLY A 2 1.888 -8.251 -2.511 1.00 36.63 N ATOM 2 CA GLY A 2 2.571 -8.428 -1.248 1.00 33.02 C ATOM 3 C GLY A 2 2.586 -7.069 -0.589 1.00 30.43 C ATOM 4 O GLY A 2 2.833 -6.107 -1.311 1.00 33.27 O ATOM 5 N GLY A 3 2.302 -6.984 0.693 1.00 24.67 N ATOM 6 CA GLY A 3 2.176 -5.723 1.348 1.00 18.88 C ATOM 7 C GLY A 3 0.700 -5.426 1.526 1.00 16.58 C ATOM 8 O GLY A 3 -0.187 -6.142 1.010 1.00 12.47 O ATOM 9 N LEU A 4 0.494 -4.400 2.328 1.00 15.00 N ...


(file truncated here)

ATOM 1078 CG GLU A 136 -0.873 9.368 16.046 1.00 38.96 C ATOM 1079 CD GLU A 136 -0.399 9.054 17.456 1.00 44.66 C ATOM 1080 OE1 GLU A 136 0.789 8.749 17.641 1.00 47.97 O ATOM 1081 OE2 GLU A 136 -1.236 9.099 18.361 1.00 47.75 O ATOM 1082 OXT GLU A 136 0.764 12.146 12.712 1.00 26.22 O TER 1083 GLU A 136 HETATM 1084 O HOH 200 -1.905 -7.624 2.822 1.00 14.50 O HETATM 1085 O HOH 201 -8.374 7.981 9.202 1.00 20.77 O HETATM 1086 O HOH 202 -4.047 9.199 11.632 1.00 38.24 O HETATM 1087 O HOH 203 6.172 14.210 8.483 1.00 14.50 O HETATM 1088 O HOH 204 2.903 7.804 15.329 1.00 24.51 O HETATM 1089 O HOH 205 16.654 0.676 11.968 1.00 10.49 O ...


(file truncated here)

HETATM 1157 O HOH 286 6.960 14.840 -3.025 1.00 35.59 O HETATM 1158 O HOH 287 -3.222 10.410 7.061 1.00 38.91 O HETATM 1159 O HOH 288 28.306 0.551 4.876 1.00 52.13 O HETATM 1160 O HOH 290 21.506 -12.424 9.751 1.00 31.68 O HETATM 1161 O HOH 291 12.951 10.424 -7.324 1.00 46.10 O HETATM 1162 O HOH 292 18.119 -15.184 14.793 1.00 56.82 O HETATM 1163 O HOH 293 13.501 22.220 8.216 1.00 43.30 O HETATM 1164 O HOH 294 13.916 -11.387 9.695 1.00 47.13 O MASTER 240 0 0 1 11 0 0 6 1163 1 0 11 END

PDB files are long, mostly due to the need for information about each atom in the molecule; this relatively short one, when complete, is extensive—28 formatted pages. I cut it here to a little over three pages, showing just enough of the principal sections to give you the overall idea.

The PDB web site has the basic documents you need to read and program with PDB files. The Protein Data Bank Contents Guide (http://www.rcsb.org/pdb/docs/format/pdbguide2.2/guide2.2_frame.html) is the best reference, and there are also FAQs and additional documents available.

In the following sections, you'll extract information from these files. Since the information in these files describes the 3D structure of macromolecules, the files are frequently used by graphical programs that display a spatial representation of the molecules. The scope of this book does not include graphics; however, you will see how to get spatial coordinates out of the files. The largest part of PDB files are the ATOM record type lines containing the coordinates of the atoms. Because of this level of detail, PDB files are typically longer than GenBank records. (Note the inconsistent terminology—a unit of PDB is the file, which contains one structure; a unit of GenBank is the record, which contains one entry.)


11.3.1 PDB File Format

Let's take a look at a PDB file and the documentation that tells how the information is formatted in a PDB file. Based on that information, you'll parse the file to extract information of interest.

PDB files are composed of lines of 80 columns that begin with one of several predefined record names and end with a newline. ("Column" means position on a line: the first character is in the first column, and so forth.) Blank columns are padded with spaces. A record type is one or more lines with the same record name. Different record types have different types of fields defined within the lines. They are also grouped according to function.

The SEQRES record type is one of four record types in the Primary Structure Section, which presents the primary structure of the peptide or nucleotide sequence:


DBREF

Reference to the entry in the sequence database(s)



SEQADV

Identification of conflicts between PDB and the named sequence database



SEQRES

Primary sequence of backbone residues


MODRES

Identification of modifications to standard residues

The DBREF and SEQADV record types in the example PDB entry from the previous section give reference information and details on conflicts between the PDB and the original database. (The example doesn't include a MODRES record type.) Here are those record types from the entry:

DBREF 1C1F A 1 136 SWS P26788 LEG_CONMY 1 135

SEQADV 1C1F LEU A 135 SWS P26788 SEE REMARK 999

Briefly, the DBREF line states there's a PDB file called 1C1F (from a file named pdb1c1f.ent), the residues in chain A are numbered from 1 to 136 in the original Swiss-Prot (SWS) database, the ID number P26788 and the name LEG_CONMY are assigned in that database (in many databases these are identical), and the residues are numbered 1 to 135 in PDB. The discrepancy in the numbering between the original database and PDB is explained in the SEQADV record type, which refers you to a REMARK 999 line (not shown here) where you discover that the PDB entry disagrees with the Swiss-Prot sequence concerning a leucine at position 135 (perhaps two different groups determined the structure, and they disagree at this point).[2]


[2] The cross-referencing to different databases is problematic in older PDB files: it may be missing, or buried somewhere in a REMARK 999 line.

You can see that to parse the information in those two lines by a program requires several steps, such as following links to other lines in the PDB entry that further explain discrepancies and identifying other databases.

Links between databases are important in bioinformatics. Table 11-1 displays the databases that are referred to in PDB files. As you already know, there are many biological databases; those shown here have a good deal of protein or structural data.
Table 11-1. Databases referenced in PDB files


Database

PDB code

BioMagResBank

BMRB

BLOCKS

BLOCKS

European Molecular Biology Laboratory

EMBL

GenBank

GB

Genome Data Base

GDB

Nucleic Acid Database

NDB

PROSITE

PROSIT

Protein Data Bank

PDB

Protein Identification Resource

PIR

SWISS-PROT


SWS

TREMBL

TREMBL

11.3.2 SEQRES

For starters, let's try a fairly easy task in Perl: extracting the amino acid sequence data. To extract the amino acid primary sequence information, you need to parse the record type SEQRES. Here is a SEQRES line from the PDB file listed earlier:

SEQRES 1 A 136 SER GLY GLY LEU GLN VAL LYS ASN PHE ASP PHE THR VAL

The following code shows the SEQRES record type as defined in the Protein Data Bank Contents Guide. This section on SEQRES, which is a fairly simple record type, is shown in its entirely to help familiarize you with this kind of documentation.


SEQRES

Overview


SEQRES records contain the amino acid or nucleic acid sequence of residues in

each chain of the

macromolecule that was studied.
Record Format
COLUMNS DATA TYPE FIELD DEFINITION

---------------------------------------------------------------------------------

1 - 6 Record name "SEQRES"
9 - 10 Integer serNum Serial number of the SEQRES record

for the current chain. Starts at 1

and increments by one each line.

Reset to 1 for each chain.


12 Character chainID Chain identifier. This may be any

single legal character, including a

blank which is used if there is

only one chain.


14 - 17 Integer numRes Number of residues in the chain.

This value is repeated on every

record.

20 - 22 Residue name resName Residue name.

24 - 26 Residue name resName Residue name.
28 - 30 Residue name resName Residue name.
32 - 34 Residue name resName Residue name.
36 - 38 Residue name resName Residue name.
40 - 42 Residue name resName Residue name.
44 - 46 Residue name resName Residue name.
48 - 50 Residue name resName Residue name.
52 - 54 Residue name resName Residue name.
56 - 58 Residue name resName Residue name.
60 - 62 Residue name resName Residue name.
64 - 66 Residue name resName Residue name.
68 - 70 Residue name resName Residue name.
Details
* PDB entries use the three-letter abbreviation for amino acid names and the

one-letter code for nucleic acids.


* In the case of non-standard groups, a hetID of up to three (3) alphanumeric

characters is used. Common HET names appear in the HET dictionary.


* Each covalently contiguous sequence of residues (connected via the "backbone"

atoms) is represented as an individual chain.


* Heterogens which are integrated into the backbone of the chain are listed as

being part of the chain and are included in the SEQRES records for that chain.


* Each set of SEQRES records and each HET group is assigned a component number.

The component number is assigned serially beginning with 1 for the first set

of SEQRES records. This number is given explicitly in the FORMUL record, but

only implicitly in the SEQRES record.

* The SEQRES records must list residues present in the molecule studied, even

if the coordinates are not present.

* C- and N-terminus residues for which no coordinates are provided due to

disorder must be listed on SEQRES.


* All occurrences of standard amino or nucleic acid residues (ATOM records)

must be listed on a SEQRES record. This implies that a numRes of 1 is valid.


* No distinction is made between ribo- and deoxyribonucleotides in the SEQRES

records. These residues are identified with the same residue name (i.e., A,

C, G, T, U, I).
* If the entire residue sequence is unknown, the serNum in column 10 is "0",

the number of residues thought to comprise the molecule is entered as numRes

in columns 14 - 17, and resName in columns 20 - 22 is "UNK".
* In case of microheterogeneity, only one of the sequences is presented. A

REMARK is generated to explain this and a SEQADV is also generated.


Verification/Validation/Value Authority Control
The residues presented on the SEQRES records must agree with those found in

the ATOM records.


The SEQRES records are checked by PDB using the sequence databases and

information provided by the depositor.


SEQRES is compared to the ATOM records during processing, and both are checked

against the sequence database. All discrepancies are either resolved or

annotated in the entry.
Relationships to Other Record Types
The residues presented on the SEQRES records must agree with those found in

the ATOM records. DBREF refers to the corresponding entry in the sequence

databases. SEQADV lists all discrepancies between the entry's sequence for

which there are coordinates and that referenced in the sequence database.

MODRES describes modifications to a standard residue.
Example

1 2 3 4 5 6 7

1234567890123456789012345678901234567890123456789012345678901234567890

SEQRES 1 A 21 GLY ILE VAL GLU GLN CYS CYS THR SER ILE CYS SER LEU

SEQRES 2 A 21 TYR GLN LEU GLU ASN TYR CYS ASN

SEQRES 1 B 30 PHE VAL ASN GLN HIS LEU CYS GLY SER HIS LEU VAL GLU

SEQRES 2 B 30 ALA LEU TYR LEU VAL CYS GLY GLU ARG GLY PHE PHE TYR

SEQRES 3 B 30 THR PRO LYS ALA

SEQRES 1 C 21 GLY ILE VAL GLU GLN CYS CYS THR SER ILE CYS SER LEU

SEQRES 2 C 21 TYR GLN LEU GLU ASN TYR CYS ASN

SEQRES 1 D 30 PHE VAL ASN GLN HIS LEU CYS GLY SER HIS LEU VAL GLU

SEQRES 2 D 30 ALA LEU TYR LEU VAL CYS GLY GLU ARG GLY PHE PHE TYR

SEQRES 3 D 30 THR PRO LYS ALA

Known Problems
Polysaccharides do not lend themselves to being represented in SEQRES.
There is no mechanism provided to describe sequence runs when the exact

ordering of the sequence is not known.


For cyclic peptides, PDB arbitrarily assigns a residue as the N-terminus.
For microheterogeneity only one of the possible residues in a given position

is provided in SEQRES.


No distinction is made between ribo- and deoxyribonucleotides in the SEQRES

records. These residues are identified with the same residue name (i.e., A,

C, G, T, U).

The structure of the line containing the SEQRES record type is fairly straightforward, with fields assigned to specific locations or columns in the line. You'll see later how to use these locations to parse the information. Note that the documentation includes many details that arise when handling such complex experimental data.

Apart from the fairly standard problem of accumulating the sequence, there is the added complication of multiple strands. By reading the documentation just shown, you'll see that the SEQRES identifier is followed by a number representing the line number for that chain, and the chain is given in the next field (although in older records it was optional and may be blank). Following those fields comes a number that gives the total number of residues in the chain. Finally, after that, come residues represented as three-letter codes. What is needed, and what can be ignored to meet our programming goals?

11.4 Parsing PDB Files

First, Example 11-5 shows the main program and three subroutines that will be discussed in this section.


Example 11-5. Extract sequence chains from PDB file

#!/usr/bin/perl

# Extract sequence chains from PDB file

use strict;

use warnings;

use BeginPerlBioinfo; # see Chapter 6 about this module
# Read in PDB file: Warning - some files are very large!

my @file = get_file_data('pdb/c1/pdb1c1f.ent');


# Parse the record types of the PDB file

my %recordtypes = parsePDBrecordtypes(@file);


# Extract the amino acid sequences of all chains in the protein

my @chains = extractSEQRES( $recordtypes{'SEQRES'} );


# Translate the 3-character codes to 1-character codes, and print

foreach my $chain (@chains) {

print "****chain $chain **** \n";

print "$chain\n";

print iub3to1($chain), "\n";

}
exit;


################################################################################

# Subroutines for Example 11-5

################################################################################
# parsePDBrecordtypes

#

#--given an array of a PDB file, return a hash with



# keys = record type names

# values = scalar containing lines for that record type


sub parsePDBrecordtypes {
my @file = @_;
use strict;

use warnings;

my %recordtypes = ( );

foreach my $line (@file) {

# Get the record type name which begins at the

# start of the line and ends at the first space

my($recordtype) = ($line =~ /^(\S+)/);

# .= fails if a key is undefined, so we have to

# test for definition and use either .= or = depending

if(defined $recordtypes{$recordtype} ) {

$recordtypes{$recordtype} .= $line;

}else{


$recordtypes{$recordtype} = $line;

}

}



return %recordtypes;

}

# extractSEQRES

#

#--given an scalar containing SEQRES lines,


# return an array containing the chains of the sequence
sub extractSEQRES {
use strict;

use warnings;


my($seqres) = @_;
my $lastchain = '';

my $sequence = '';

my @results = ( );
# make array of lines

my @record = split ( /\n/, $seqres);

foreach my $line (@record) {

# Chain is in column 12, residues start in column 20

my ($thischain) = substr($line, 11, 1);

my($residues) = substr($line, 19, 52); # add space at end

# Check if a new chain, or continuation of previous chain

if("$lastchain" eq "") {

$sequence = $residues;

}elsif("$thischain" eq "$lastchain") {

$sequence .= $residues;

# Finish gathering previous chain (unless first record)

}elsif ( $sequence ) {

push(@results, $sequence);

$sequence = $residues;

}

$lastchain = $thischain;



}
# save last chain

push(@results, $sequence);

return @results;

}
# iub3to1

#

#--change string of 3-character IUB amino acid codes (whitespace separated)



# into a string of 1-character amino acid codes
sub iub3to1 {
my($input) = @_;

my %three2one = (

'ALA' => 'A',

'VAL' => 'V',

'LEU' => 'L',

'ILE' => 'I',

'PRO' => 'P',

'TRP' => 'W',

'PHE' => 'F',

'MET' => 'M',

'GLY' => 'G',

'SER' => 'S',

'THR' => 'T',

'TYR' => 'Y',

'CYS' => 'C',

'ASN' => 'N',

'GLN' => 'Q',

'LYS' => 'K',

'ARG' => 'R',

'HIS' => 'H',

'ASP' => 'D',

'GLU' => 'E',

);

# clean up the input

$input =~ s/\n/ /g;

my $seq = '';

# This use of split separates on any contiguous whitespace

my @code3 = split(' ', $input);
foreach my $code (@code3) {

# A little error checking

if(not defined $three2one{$code}) {

print "Code $code not defined\n";

next;

}

$seq .= $three2one{$code};



}

return $seq;

}

It's important to note that the main program, which calls the subroutine get_file_data to read in the PDB file, has included a warning about the potentially large size of any given PDB file. (For instance, the PDB file 1gav weighs in at 3.45 MB.) Plus, the main program follows the reading in of the entire file, with the subroutine parsePDBrecordtypes that makes copies of all lines in the input file, separated by record type. At this point, the running program is using twice the amount of memory as the size of the file. This design has the advantage of clarity and modularity, but it can cause problems if main memory is in short supply. The use of memory can be lessened by not saving the results of reading in the file, but instead passing the file data directly to the parsePDBrecordtypes subroutine, like so:


# Get the file data and parse the record types of the PDB file

%recordtypes = parsePDBrecordtypes(get_file_data('pdb/c1/pdb1c1f.ent'));

Further savings of memory are possible. For instance, you can rewrite the program to just read the file one line at a time while parsing the data into the record types. I point out these considerations to give you an idea of the kinds of choices that are practically important in processing large files. However, let's stick with this design for now. It may be expensive in terms of memory, but it's very clear in terms of overall program structure.

In Chapter 10, I demonstrated two ways to parse GenBank files into sequence and annotation and then how to parse the annotation into finer and finer levels of detail.

The first method involved iterating through an array of the lines of the record. Recall that due to the structure of multiline fields, it was necessary to set flags while iterating to keep track of which field the input line was in.[3]

[3] In GenBank, the multiline information sets were called fields; in PDB, they're called record types. Just as in biology different researchers may use their own terminology for some structures or concepts, so too in computer science there can be a certain creativity in terminology. This is one of the interesting difficulties in integrating biological data sources.

The other method, which worked better for GenBank files, involved regular expressions. Which method will work best for PDB files? (Or should you settle on a third approach?)

There are several ways to extract this information. PDB makes it easy to collect record types, since they all start with the same keyword at the beginning of the line. The technique in the last chapter that used regular expressions parsed the top-level fields of the file; this would be somewhat unwieldy for PDB files. (See the exercises at the end of the chapter.) For instance, a regular expression such as the following matches all adjacent SEQRES lines into a scalar string:

$record =~ /SEQRES.*\n(SEQRES.*\n)*/;

$seqres = $&;

The regular expression matches a single SEQRES line with SEQRES.*\n and then matches zero or more additional lines with (SEQRES.*\n)*. Notice how the final * indicates zero or more of the preceding item, namely, the parenthesized expression (SEQRES.*\n). Also note that the .* matches zero or more nonnewline characters. Finally, the second line captures the pattern matched, denoted by $&, into the variable $seqres.

To extend this to capture all record types, see the exercises at the end of the chapter.

For PDB files, each line starts with a keyword that explicitly states to which record type that line belongs. You will find in the documentation that each record type has all its lines adjacent to each other in a group. In this case, it seems that simply iterating through the lines and collecting the record types would be the simplest programming approach.


Example 11-5 contains a subroutine parsePDBrecordtypes that parses the PDB record types from an array containing the lines of the PDB record. This is a short, clean subroutine that accomplishes what is needed. The comments describe what's happening pretty well, which, as you know, is a critical factor in writing good code. Basically, each line is examined for its record type and is then added to the value of a hash entry with the record type as the key. The hash is returned from the subroutine.



1   ...   20   21   22   23   24   25   26   27   28


The database is protected by copyright ©hestories.info 2017
send message

    Main page