Beginning Perl for Bioinformatics

Download 1.4 Mb.
Date conversion29.03.2017
Size1.4 Mb.
1   ...   13   14   15   16   17   18   19   20   ...   28

8.5.1 FASTA Format

Let's write a subroutine that can handle FASTA-style data. This is useful in its own right and as a warm-up for the upcoming chapters on GenBank, PDB, and BLAST.

FASTA format is basically just lines of sequence data with newlines at the end so it can be printed on a page or displayed on a computer screen. The length of the lines isn't specified, but for compatibility, it's best to limit them to 80 characters in length. There is also header information, a line or lines at the beginning of the file that start with the greater-than > character, that can contain any text whatsoever (or no text). Typically, a header line contains the name of the DNA or the gene it comes from, often separated by a vertical bar from additional information about the sequence, the experiment that produced it, or other, nonsequence information of that nature.

Much FASTA-aware software insists that there must be only one header line; others permit several lines. Our subroutine will accept either one or several header lines plus comments beginning with #.

The following is a FASTA file. We'll call it sample.dna and use it in several programs. You should copy it, download it from this book's web site, or make up your own file with your own data.

> sample dna | (This is a typical fasta header.)























8.5.2 A Design to Read FASTA Files

In Chapter 4, you learned how to read in sequence data; here, you just have to extend that method to deal with the header lines. You'll also learn how to discard empty lines and lines that begin with the pound sign #, i.e., comments in Perl and other languages and file formats. (These don't appear in the FASTA file sample.dna just shown.)

There are two choices when reading in the data. You can read from the open file one line at a time, making decisions as you go. Or, you can slurp the whole file into an array and then operate on the array. For very big files, it's sometimes best to read them one line at a time, especially if you're looking for some small bit of information. (This is because reading a large file into an array uses a large amount of memory. If your system isn't robust enough, it may crash.)

For smaller, normal-sized files, the advantage to reading all the data into an array is that you can then easily look through at the data and do operations on it. That's what we'll do with our subroutine, but remember, this approach can cause memory space problems with larger files, and there are other ways of proceeding.

Let's write a subroutine that, given as an argument a filename containing FASTA-formatted data, returns the sequence data.

Before doing so you should think about whether you should have just one subroutine, or perhaps one subroutine that opens and reads a file, called by another subroutine that extracts the sequence data. Let's use two subroutines, keeping in mind that you can reuse the subroutine that deals with arbitrary files every time you need to write such a program for other formats.

Let's start with some pseudocode:

subroutine get data from a file

argument = filename
open file

if can't open, print error message and exit

read in data and
return @data

Subroutine extract sequence data from fasta file

argument = array of file data in fasta format
Discard all header lines

(and blank and comment lines for good measure)

If first character of first line is >, discard it
Read in the rest of the file, join in a scalar,

edit out nonsequence data

return sequence


In the first subroutine that gets data from a file, there's a question as to what's the best thing to do when the file can't be read. Here, we're taking the drastic approach: yelling "Fire!" and exiting. But you wouldn't necessarily want your program to just stop whenever it can't open a file. Maybe you're asking for filenames from the user at the keyboard or on a web page, and you'd like to give them three chances to type in the filename correctly. Or maybe, if the file can't be opened, you want a default file instead.

Maybe you can return a false value, such as an empty array, if you can't open the file. Then a program that calls this subroutine can exit, try again, or whatever it wants. But what if you successfully open the file, but it was absolutely empty? Then you'd have succeeded and returned an empty array, and the program calling this subroutine would think incorrectly, that the file couldn't be opened. So, that wouldn't work.

There are other options, such as returning the special "undefined" value. Let's keep what we've got, but it's important to remember that handling errors can be an important, and sometimes tricky, part of writing robust code, code that responds well in unusual circumstances.

The second subroutine takes the array of FASTA-formatted sequence and returns just the unformatted sequence in a string.

8.5.3 A Subroutine to Read FASTA Files

Now that you've thought about the problem, written some pseudocode, considered alternate ways of designing the subroutines and the costs and benefits of the choices, you're ready to code:

# get_file_data


# A subroutine to get data from a file given its filename

sub get_file_data {
my($filename) = @_;
use strict;

use warnings;

# Initialize variables

my @filedata = ( );

unless( open(GET_FILE_DATA, $filename) ) {

print STDERR "Cannot open file \"$filename\"\n\n";


@filedata = ;

return @filedata;

# extract_sequence_from_fasta_data


# A subroutine to extract FASTA sequence data from an array

sub extract_sequence_from_fasta_data {
my(@fasta_file_data) = @_;
use strict;

use warnings;

# Declare and initialize variables

my $sequence = '';

foreach my $line (@fasta_file_data) {
# discard blank line

if ($line =~ /^\s*$/) {

# discard comment line

} elsif($line =~ /^\s*#/) {

# discard fasta header line

} elsif($line =~ /^>/) {

# keep line, add to sequence string

} else {

$sequence .= $line;



# remove non-sequence data (in this case, whitespace) from $sequence string

$sequence =~ s/\s//g;

return $sequence;


Notice that nowhere in the code for extract_sequence_from_fasta_data do you check to see what's in the file: is it really DNA or protein sequence data in FASTA format? Of course, you can write a subroutine—call it is_fasta—that checks the data to see if it's what we expect. But I'll leave that for the exercises.

A few comments about the extract_sequence_from_fasta_data subroutine should be made. The following line includes a variable declaration as it is used in a loop:

foreach my $line (@fasta_file_data) {

You've seen this in for loops as well. It's convenient to declare these my variables as $line on the spot, as they tend to have common names and aren't used outside the loop.

Some of the regular expressions deserve brief comment. In this line:

if ($line =~ /^\s*$/) {

the \s matches whitespace, that is, space, tab, formfeed, carriage return, or newline. \s* matches any amount of whitespace (even none). The ^ matches the beginning of the line, and the $ matches the end of the line. So altogether, this regular expression matches blank lines with nothing or only whitespace in them.

This regular expression also has nothing or only whitespace at the beginning of the line, up to a pound sign:

} elsif($line =~ /^\s*#/) {

This expression matches a greater-than sign at the beginning of the line:

} elsif($line =~ /^>/) {

Finally, the following statement removes whitespace, including newlines:

$sequence =~ s/\s//g;

We've placed these two new subroutines into our module. Now let's write a main program for these subroutines and look at the output. First, there's one more subroutine to write that handles the printing of long sequences.

8.5.4 Writing Formatted Sequence Data

When you try to print the "raw" sequence data, it can be a problem if the data is much longer than the width of the page. For most practical purposes, 80 characters is about the maximum length you should try to fit across a page. Let's write a print_sequence subroutine that takes as its arguments some sequence and a line length and prints out the sequence, breaking it up into lines of that length. It will have a strong similarity to the dna2peptide subroutine. Here it is:

# print_sequence


# A subroutine to format and print sequence data

sub print_sequence {
my($sequence, $length) = @_;
use strict;

use warnings;

# Print sequence in lines of $length

for ( my $pos = 0 ; $pos < length($sequence) ; $pos += $length ) {

print substr($sequence, $pos, $length), "\n";



The code depends on the behavior of substr, which gives the partial substring at the end of the string, even if it's less than the requested length. You can see there's a new print_sequence subroutine in the module (see Chapter 6). We remembered to keep the statement 1; as the last line of the module. Example 8-2 shows the main program.

Example 8-2. Read a FASTA file and extract the sequence data


# Read a fasta file and extract the sequence data
use strict;

use warnings;

use BeginPerlBioinfo; # see Chapter 6 about this module
# Declare and initialize variables

my @file_data = ( );

my $dna = '';
# Read in the contents of the file "sample.dna"

@file_data = get_file_data("sample.dna");

# Extract the sequence data from the contents of the file "sample.dna"

$dna = extract_sequence_from_fasta_data(@file_data);

# Print the sequence in lines 25 characters long

print_sequence($dna, 25);


Here's the output of Example 8-2:













































8.5.5 A Main Program for Reading DNA and Writing Protein

Now, one final program for this section. Let's add to the preceding program a translation from DNA to protein and print out the protein instead. Notice how short Example 8-3 is! As you accumulate useful subroutines in our modules, programs get easier and easier to write.

Example 8-3. Read a DNA FASTA file, translate to protein, and format output


# Read a fasta file and extract the DNA sequence data

# Translate it to protein and print it out in 25-character-long lines

use strict;

use warnings;

use BeginPerlBioinfo; # see Chapter 6 about this module
# Initialize variables

my @file_data = ( );

my $dna = '';

my $protein = '';

# Read in the contents of the file "sample.dna"

@file_data = get_file_data("sample.dna");

# Extract the sequence data from the contents of the file "sample.dna"

$dna = extract_sequence_from_fasta_data(@file_data);

# Translate the DNA to protein

$protein = dna2peptide($dna);

# Print the sequence in lines 25 characters long

print_sequence($protein, 25);


Here's the output of Example 8-3:
















8.6 Reading Frames

The biologist knows that, given a sequence of DNA, it is necessary to examine all six reading frames of the DNA to find the coding regions the cell uses to make proteins.

8.6.1 What Are Reading Frames?

Very often you won't know where in the DNA you're studying the cell actually begins translating the DNA into protein. Only about 1-1.5% of human DNA is in genes, which are the parts of DNA used for the translation into proteins. Furthermore, genes very often occur in pieces that are spliced together during the transcription/translation process.

If you don't know where the translation starts, you have to consider the six possible reading frames. Since the codons are three bases long, the translation happens in three "frames," for instance starting at the first base, or the second, or perhaps the third. (The fourth would be the same as starting from the first.) Each starting place gives a different series of codons, and, as a result, a different series of amino acids.

Also, transcription and translation can happen on either strand of the DNA; that is, either the DNA sequence, or its reverse complement, might contain DNA code that is actually translated. The reverse complement can also be read in any one of three frames. So a total of six reading frames have to be considered when looking for coding regions , that part of the DNA that encodes proteins.

It is therefore quite common to examine all six reading frames of a DNA sequence and to look at the resulting protein translations for long stretches of amino acids that lack stop codons.

The stop codons are definite breaks in the DNAp

rotein translation process. During translation (actually of RNA to protein, but I'm being deliberately informal and vague about the biochemistry), if a stop codon is reached, the translation stops, and the growing peptide chain grows no more.

Long stretches of DNA that don't contain any stop codons are called open reading frames (ORFs) and are important clues to the presence of a gene in the DNA under study. So gene finder programs need to perform the type of reading frame analysis we'll do in this chapter.

8.6.2 Translating Reading Frames

Based on the facts just presented, let's write some code that translates the DNA in all six reading frames.

In the real world, you'd look around for some subroutines that are already written to do that task. Given the basic nature of the task—something anyone who studies DNA has to do—you'd likely find something. But this is a tutorial, not the real world, so let's soldier on.

This problem doesn't sound too daunting. So, take stock of the subroutines at your disposal, think of where you are and how you can get to your destination.

Looking through the subroutines we've already written, recall dna2peptide. You may recall considering adding some arguments to specify starting and end points. Let's do this now.

Remember that although we calculated reverse complements back in Chapter 4, we never made a subroutine out of it. So let's start there:

# revcom


# A subroutine to compute the reverse complement of DNA sequence

sub revcom {
my($dna) = @_;
# First reverse the sequence

my($revcom) = reverse($dna);

# Next, complement the sequence, dealing with upper and lower case

# A->T, T->A, C->G, G->C

$revcom =~ tr/ACGTacgt/TGCAtgca/;
return $revcom;


Now, a little pseudocode to sketch an idea for the subroutine that will translate specific ranges of DNA:

Given DNA sequence
subroutine translate_frame ( DNA, start, end)
return dna2peptide( substr( DNA, start, end - start + 1 ) )


That went well! Luckily, the substr built-in Perl function made it easy to apply the desired start and end points, while passing the DNA into the already written dna2peptide subroutine.

Note that the length of the sequence is end-start+1. To give a small example: if you start at position 3 and end at position 5, you've got the bases at positions 3, 4, and 5, three bases in all, which is exactly what 5 - 3 + 1 equals.

Dealing with indices like this has to be done carefully, or the code won't work. For many programs, this is the worst the mathematics gets.

Pay attention to the indices
You have to decide if you wish to keep the numbering of positions from 0, which is Perl's way to do it, or the first character of the sequence is in position 1, which is the biologist's way to do it. Let's do it the biologist's way. The positions will be decreased by one when passed to the Perl function substr, which, of course, does it Perl's way.

The corrected pseudocode looks like this:

Given DNA sequence
subroutine translate_frame ( DNA, start, end)
# start and end are numbering the sequence from 1 to length
return dna2peptide( substr( DNA, start - 1, end - start + 1 ) )


The length of the desired sequence doesn't change with the change in indices, since:

(end - 1) - (start - 1) + 1 = end - start + 1

So let's write this subroutine:

# translate_frame


# A subroutine to translate a frame of DNA

sub translate_frame {
my($seq, $start, $end) = @_;
my $protein;
# To make the subroutine easier to use, you won't need to specify

# the end point--it will just go to the end of the sequence

# by default.

unless($end) {

$end = length($seq);

# Finally, calculate and return the translation

return dna2peptide ( substr ( $seq, $start - 1, $end -$start + 1) );


Example 8-4 translates the DNA in all six reading frames.

Example 8-4. Translate a DNA sequence in all six reading frames


# Translate a DNA sequence in all six reading frames

use strict;

use warnings;

use BeginPerlBioinfo; # see Chapter 6 about this module
# Initialize variables

my @file_data = ( );

my $dna = '';

my $revcom = '';

my $protein = '';
# Read in the contents of the file "sample.dna"

@file_data = get_file_data("sample.dna");

# Extract the sequence data from the contents of the file "sample.dna"

$dna = extract_sequence_from_fasta_data(@file_data);

# Translate the DNA to protein in six reading frames

# and print the protein in lines 70 characters long

print "\n -------Reading Frame 1--------\n\n";

$protein = translate_frame($dna, 1);

print_sequence($protein, 70);
print "\n -------Reading Frame 2--------\n\n";

$protein = translate_frame($dna, 2);

print_sequence($protein, 70);
print "\n -------Reading Frame 3--------\n\n";

$protein = translate_frame($dna, 3);

print_sequence($protein, 70);
# Calculate reverse complement

$revcom = revcom($dna);

print "\n -------Reading Frame 4--------\n\n";

$protein = translate_frame($revcom, 1);

print_sequence($protein, 70);
print "\n -------Reading Frame 5--------\n\n";

$protein = translate_frame($revcom, 2);

print_sequence($protein, 70);

print "\n -------Reading Frame 6--------\n\n";

$protein = translate_frame($revcom, 3);

print_sequence($protein, 70);


Here's the output of Example 8-4:

-------Reading Frame 1--------






-------Reading Frame 2--------






-------Reading Frame 3--------






-------Reading Frame 4--------







-------Reading Frame 5--------






-------Reading Frame 6--------






1   ...   13   14   15   16   17   18   19   20   ...   28

The database is protected by copyright © 2017
send message

    Main page