Beginning Perl for Bioinformatics


Example 4-4. Calculating the reverse complement of a strand of DNA



Download 1.4 Mb.
Page7/28
Date conversion29.03.2017
Size1.4 Mb.
1   2   3   4   5   6   7   8   9   10   ...   28

Example 4-4. Calculating the reverse complement of a strand of DNA

#!/usr/bin/perl -w

# Calculating the reverse complement of a strand of DNA
# The DNA

$DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';


# Print the DNA onto the screen

print "Here is the starting DNA:\n\n";


print "$DNA\n\n";
# Calculate the reverse complement

# Warning: this attempt will fail!

#

# First, copy the DNA into new variable $revcom



# (short for REVerse COMplement)

# Notice that variable names can use lowercase letters like

# "revcom" as well as uppercase like "DNA". In fact,

# lowercase is more common.

#

# It doesn't matter if we first reverse the string and then



# do the complementation; or if we first do the complementation

# and then reverse the string. Same result each time.

# So when we make the copy we'll do the reverse in the same statement.

#
$revcom = reverse $DNA;


#

# Next substitute all bases by their complements,

# A->T, T->A, G->C, C->G

#
$revcom =~ s/A/T/g;

$revcom =~ s/T/A/g;

$revcom =~ s/G/C/g;

$revcom =~ s/C/G/g;
# Print the reverse complement DNA onto the screen

print "Here is the reverse complement DNA:\n\n";


print "$revcom\n";
#

# Oh-oh, that didn't work right!

# Our reverse complement should have all the bases in it, since the

# original DNA had all the bases--but ours only has A and G!

#

# Do you see why?


#

# The problem is that the first two substitute commands above change

# all the A's to T's (so there are no A's) and then all the

# T's to A's (so all the original A's and T's are all now A's).

# Same thing happens to the G's and C's all turning into G's.

#

print "\nThat was a bad algorithm, and the reverse complement was wrong!\n";

print "Try again ... \n\n";
# Make a new copy of the DNA (see why we saved the original?)

$revcom = reverse $DNA;


# See the text for a discussion of tr///

$revcom =~ tr/ACGTacgt/TGCAtgca/;


# Print the reverse complement DNA onto the screen

print "Here is the reverse complement DNA:\n\n";


print "$revcom\n";
print "\nThis time it worked!\n\n";
exit;

Here's what the output of Example 4-4 should look like on your screen:

Here is the starting DNA:
ACGGGAGGACGGGAAAATTACTACGGCATTAGC
Here is the reverse complement DNA:
GGAAAAGGGGAAGAAAAAAAGGGGAGGAGGGGA
That was a bad algorithm, and the reverse complement was wrong!

Try again ...


Here is the reverse complement DNA:
GCTAATGCCGTAGTAATTTTCCCGTCCTCCCGT

This time it worked!

You can check if two strands of DNA are reverse complements of each other by reading one left to right, and the other right to left, that is, by starting at different ends. Then compare each pair of bases as you read the two strands: they should always be paired C to G and A to T.

Just by reading in a few characters from the starting DNA and the reverse complement DNA from the first attempt, you'll see the that first attempt at calculating the reverse complement failed. It was a bad algorithm.

This is a taste of what you'll sometimes experience as you program. You'll write a program to accomplish a job and then find it didn't work as you expected. In this case, we used parts of the language we already knew and tried to stretch them to handle a new problem. Only they weren't quite up to the job. What went wrong?

You'll find that this kind of experience becomes familiar: you write some code, and it doesn't work! So you either fix the syntax (that's usually the easy part and can be done from the clues the error messages provide), or you think about the problem some more, find why the program failed, and then try to devise a new and successful way. Often this requires browsing the language documentation, looking for the details of how the language works and hoping to find a feature that fixes the problem. If it can be solved on a computer, you can solve it using Perl. The trick is, how exactly?

In Example 4-4, the first attempt to calculate the reverse complement failed. Each base in the string was translated as a whole, using four substitutions in a global fashion. Another way is needed. You could march though the DNA left to right, look at each base one at a time, make the change to the complement, and then look at the next base in the DNA, marching on to the end of the string. Then just reverse the string, and you're done. In fact, this is a perfectly good method, and it's not hard to do in Perl, although it requires some parts of the language not found until Chapter 5.

However, in this case, the tr operator—which stands for transliterate or translation—is exactly suited for this task. It looks like the substitute command, with the three forward slashes separating the different parts.


tr does exactly what's needed; it translates a set of characters into new characters, all at once. Figure 4-2 shows how it works: the set of characters to be translated are between the first two forward slashes. The set of characters that replaces the originals are between the second and third forward slashes. Each character in the first set is translated into the character at the same position in the second set. For instance, in Example 4-4, C is the second character in the first set, so it's translated into the second character of the second set, namely, G. Finally, since DNA sequence data can use upper- or lowercase letters (even though in this program the DNA is in uppercase only), both cases are included in the tr statement in Example 4-4.

Figure 4-2. The tr statement

The reverse function also does exactly what's needed, with a minimum of fuss. It's designed to reverse the order of elements, including strings as seen in Example 4-4.



4.7 Proteins, Files, and Arrays

So far we've been writing programs with DNA sequence data. Now we'll also include the equally important protein sequence data. Here's an overview of what is covered in the following sections:



  • How to use protein sequence data in a Perl program

  • How to read protein sequence data in from a file

  • Arrays in the Perl language

For the rest of the chapter, both protein and DNA sequence data are used.

4.8 Reading Proteins in Files

Programs interact with files on a computer disk. These files can be on hard disk, CD, floppy disk, Zip drive, magnetic tape—any kind of permanent storage.

Let's take a look at how to read protein sequence data from a file. First, create a file on your computer (use your text editor) and put some protein sequence data into it. Call the file NM_021964fragment.pep (you can download it from this book's web site). You will be using the following data (part of the human zinc finger protein NM_021964):

MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGQSTVSGELQD

SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDELMVHEETVKNDEEQMETHERLPQ

GLQYALNVPISVKQEITFTDVSEQLMRDKKQIR

You can use any name, except one that's already in use in the same folder.

Just as well-chosen variable names can be critical to understanding a program, well-chosen file and folder names can also be critical. If you have a project that generates lots of computer files, you need to carefully consider how to name and organize the files and folders. This is as true for individual researchers as for large, multi-national teams. It's important to put some effort into assigning informative names to files.

The filename NM_021964fragment.pep is taken from the GenBank ID of the record where this protein is found. It also indicates the fragmentary nature of the data and contains the filename extension .pep to remind you that the file contains peptide or protein sequence data. Of course, some other scheme might work better for you; the point is to get some idea of what's in the file without having to look into it.

Now that you've created or downloaded a file with protein sequence data in it, let's develop a program that reads the protein sequence data from the file and stores it into a variable. Example 4-5 shows a first attempt, which will be added to as we progress.


Example 4-5. Reading protein sequence data from a file

#!/usr/bin/perl -w

# Reading protein sequence data from a file
# The filename of the file containing the protein sequence data

$proteinfilename = 'NM_021964fragment.pep';


# First we have to "open" the file, and associate

# a "filehandle" with it. We choose the filehandle

# PROTEINFILE for readability.

open(PROTEINFILE, $proteinfilename);


# Now we do the actual reading of the protein sequence data from the file,

# by using the angle brackets < and > to get the input from the

# filehandle. We store the data into our variable $protein.

$protein =


;
# Now that we've got our data, we can close the file.

close PROTEINFILE;


# Print the protein onto the screen

print "Here is the protein:\n\n";


print $protein;
exit;

Here's the output of Example 4-5:

Here is the protein:

MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGQSTVSGELQD

Notice that only the first line of the file prints out. I'll show why in a moment.

Let's look at Example 4-5 in more detail. After putting a filename into the variable $proteinfilename, the file is opened with the following statement:

open(PROTEINFILE, $proteinfilename);

After opening the file, you can do various things with it, such as reading, writing, searching, going to a specific location in the file, erasing everything in the file, and so on. Notice that the program assumes the file named in the variable $proteinfilename exists and can be opened. You'll see in a little bit how to check for that, but here's something to try: change the name of the filename in $proteinfilename so that there's no file of that name on your computer, and then run the program. You'll get some error messages if the file doesn't exist.

If you look at the documentation for the open function, you'll see many options. Mostly, they enable you to specify exactly what the file will be used for after it's opened.

Let's examine the term PROTEINFILE, which is called a filehandle. With filehandles, it's not important to understand what they really are. They're just things you use when you're dealing with files. They don't have to have capital letters, but it's a widely followed convention. After the open statement assigns a filehandle, all the interaction with a file is done by naming the filehandle.

The data is actually read in to the program with the statement:

$protein =

;

Why is the filehandle PROTEINFILE enclosed within angle brackets? These angle brackets are called input operators; a filehandle within angle brackets is how you bring in data from some source outside the program. Here, we're reading the file called NM_021964fragment.pep whose name is stored in variable $proteinfilename, and which has a filehandle associated with it by the open statement. The data is being stored in the variable $protein and then printed out.

However, as you've already noticed, only the first line of this multiline file is printed out. Why? Because there are a few more things to learn about reading in files.

There are several ways to read in a whole file. Example 4-6 shows one way.



Example 4-6. Reading protein sequence data from a file, take 2

#!/usr/bin/perl -w

# Reading protein sequence data from a file, take 2
# The filename of the file containing the protein sequence data

$proteinfilename = 'NM_021964fragment.pep';


# First we have to "open" the file, and associate

# a "filehandle" with it. We choose the filehandle

# PROTEINFILE for readability.

open(PROTEINFILE, $proteinfilename);


# Now we do the actual reading of the protein sequence data from the file,

# by using the angle brackets < and > to get the input from the

# filehandle. We store the data into our variable $protein.

#

# Since the file has three lines, and since the read only is



# returning one line, we'll read a line and print it, three times.
# First line

$protein =


;

# Print the protein onto the screen

print "\nHere is the first line of the protein file:\n\n";

print $protein;
# Second line

$protein =


;
# Print the protein onto the screen

print "\nHere is the second line of the protein file:\n\n";


print $protein;
# Third line

$protein =


;
# Print the protein onto the screen

print "\nHere is the third line of the protein file:\n\n";


print $protein;
# Now that we've got our data, we can close the file.

close PROTEINFILE;


exit;

Here's the output of Example 4-6:

Here is the first line of the protein file:
MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGQSTVSGELQD
Here is the second line of the protein file:
SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDELMVHEETVKNDEEQMETHERLPQ
Here is the third line of the protein file:
GLQYALNVPISVKQEITFTDVSEQLMRDKKQIR

The interesting thing about this program is that it shows how reading from a file works. Every time you read into a scalar variable such as $protein, the next line of the file is read. Something is remembering where the previous read was and is picking it up from there.

On the other hand, the drawbacks of this program are obvious. Having to write a few lines of code for each line of an input file isn't convenient. However, there are two Perl features that can handle this nicely: arrays (in the next section) and loops (in Chapter 5).

4.9 Arrays

In computer languages an array is a variable that stores multiple scalar values. The values can be numbers, strings, or, in this case, lines of an input file of protein sequence data. Let's examine how they can be used. Example 4-7 shows how to use an array to read all the lines of an input file.


Example 4-7. Reading protein sequence data from a file, take 3

#!/usr/bin/perl -w

# Reading protein sequence data from a file, take 3

# The filename of the file containing the protein sequence data

$proteinfilename = 'NM_021964fragment.pep';


# First we have to "open" the file

open(PROTEINFILE, $proteinfilename);


# Read the protein sequence data from the file, and store it

# into the array variable @protein

@protein =
;
# Print the protein onto the screen

print @protein;


# Close the file.

close PROTEINFILE;


exit;

Here's the output of Example 4-7:

MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGQSTVSGELQD

SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDELMVHEETVKNDEEQMETHERLPQ

GLQYALNVPISVKQEITFTDVSEQLMRDKKQIR

which, as you can see, is exactly the data that's in the file. Success!

The convenience of this is clear—just one line to read all the data into the program.

Notice that the array variable starts with an at sign (@) rather than the dollar sign ($) scalar variables begin with. Also notice that the print function can handle arrays as well as scalar variables. Arrays are used a lot in Perl, so you will see plenty of array examples as the book continues.

An array is a variable that can hold many scalar values. Each item or element is a scalar value that can be referenced by giving its position in the array (its subscript or offset). Let's look at some examples of arrays and their most common operations. We'll define an array @bases that holds the four bases A, C, G, and T. Then we'll apply some of the most common array operators.

Here's a piece of code that demonstrates how to initialize an array and how to use subscripts to access the individual elements of an array:

# Here's one way to declare an array, initialized with a list of four scalar values.

@bases = ('A', 'C', 'G', 'T');

# Now we'll print each element of the array

print "Here are the array elements:";

print "\nFirst element: ";

print $bases[0];

print "\nSecond element: ";

print $bases[1];

print "\nThird element: ";

print $bases[2];

print "\nFourth element: ";

print $bases[3];

This code snippet prints out:

First element: A

Second element: C

Third element: G

Fourth element: T

You can print the elements one a after another like this:

@bases = ('A', 'C', 'G', 'T');

print "\n\nHere are the array elements: ";

print @bases;

which produces the output:

Here are the array elements: ACGT

You can also print the elements separated by spaces (notice the double quotes in the print statement):

@bases = ('A', 'C', 'G', 'T');

print "\n\nHere are the array elements: ";

print "@bases";

which produces the output:

Here are the array elements: A C G T

You can take an element off the end of an array with pop:

@bases = ('A', 'C', 'G', 'T');

$base1 = pop @bases;

print "Here's the element removed from the end: ";

print $base1, "\n\n";

print "Here's the remaining array of bases: ";

print "@bases";

which produces the output:

Here's the element removed from the end: T

Here's the remaining array of bases: A C G

You can take a base off of the beginning of the array with shift:

@bases = ('A', 'C', 'G', 'T');

$base2 = shift @bases;

print "Here's an element removed from the beginning: ";

print $base2, "\n\n";

print "Here's the remaining array of bases: ";

print "@bases";

which produces the output:

Here's an element removed from the beginning: A

Here's the remaining array of bases: C G T

You can put an element at the beginning of the array with unshift:

@bases = ('A', 'C', 'G', 'T');

$base1 = pop @bases;

unshift (@bases, $base1);

print "Here's the element from the end put on the beginning: ";

print "@bases\n\n";

which produces the output:

Here's the element from the end put on the beginning: T A C G

You can put an element on the end of the array with push:

@bases = ('A', 'C', 'G', 'T');

$base2 = shift @bases;

push (@bases, $base2);

print "Here's the element from the beginning put on the end: ";

print "@bases\n\n";

which produces the output:

Here's the element from the beginning put on the end: C G T A

You can reverse the array:

@bases = ('A', 'C', 'G', 'T');

@reverse = reverse @bases;

print "Here's the array in reverse: ";

print "@reverse\n\n";

which produces the output:

Here's the array in reverse: T G C A

You can get the length of an array:

@bases = ('A', 'C', 'G', 'T');

print "Here's the length of the array: ";

print scalar @bases, "\n";

which produces the output:

Here's the length of the array: 4

Here's how to insert an element at an arbitrary place in an array using the Perl splice function:

@bases = ('A', 'C', 'G', 'T');

splice ( @bases, 2, 0, 'X');

print "Here's the array with an element inserted after the 2nd element: ";

print "@bases\n";

which produces the output:

Here's the array with an element inserted after the 2nd element: A C X G T


4.10 Scalar and List Context

Many Perl operations behave differently depending on the context in which they are used. Perl has scalar context and list context; both are listed in Example 4-8.



Example 4-8. Scalar context and list context

#!/usr/bin/perl -w

# Demonstration of "scalar context" and "list context"
@bases = ('A', 'C', 'G', 'T');
print "@bases\n";
$a = @bases;
print $a, "\n";
($a) = @bases;
print $a, "\n";
exit;

Here's the output of Example 4-8:

A C G T

4

A


First, Example 4-8 declares an array of the four bases. Then the assignment statement tries to assign an array (which is a kind of list) to a scalar variable $a:

$a = @bases;

In this kind of scalar context , an array evaluates to the size of the array, that is, the number of elements in the array. The scalar context is supplied by the scalar variable on the left side of the statement.

Next, Example 4-8 tries to assign an array (to repeat, a kind of list) to another list, in this case, having just one variable, $a:

($a) = @bases;

In this kind of list context , an array evaluates to a list of its elements. The list context is supplied by the list in parentheses on the left side of the statement. If there aren't enough variables on the left side to assign to, only part of the array gets assigned to variables. This behavior of Perl pops up in many situations; by design, many features of Perl behave differently depending on whether they are in scalar or list context. See Appendix B for more about scalar and list content.

Now you've seen the use of strings and arrays to hold sequence and file data, and learned the basic syntax of Perl, including variables, assignment, printing, and reading files. You've transcribed DNA to RNA and calculated the reverse complement of a strand of DNA. By the end of Chapter 5, you'll have covered the essentials of Perl programming.


4.11 Exercises
Exercise 4.1

Explore the sensitivity of programming languages to errors of syntax. Try removing the semicolon from the end of any statement of one of our working programs and examining the error messages that result, if any. Try changing other syntactical items: add a parenthesis or a curly brace; misspell "print" or some other reserved word; just type in, or delete, anything. Programmers get used to seeing such errors; even after getting to know the language well, it is still common to have some syntax errors as you gradually add code to a program. Notice how one error can lead to many lines of error reporting. Is Perl accurately reporting the line where the error is?



Exercise 4.2

Write a program that stores an integer in a variable and then prints it out.



Exercise 4.3

Write a program that prints DNA (which could be in upper- or lowercase originally) in lowercase (acgt); write another that prints the DNA in uppercase (ACGT). Use the function tr///.



Exercise 4.4

Do the same thing as Exercise 4.3, but use the string directives \U and \L for upper- and lowercase. For instance, print "\U$DNA" prints the data in $DNA in uppercase.



Exercise 4.5

Sometimes information flows from RNA to DNA. Write a program to reverse transcribe RNA to DNA.



Exercise 4.6

Read two files of data, and print the contents of the first followed by the contents of the second.


Exercise 4.7

This is a more difficult exercise. Write a program to read a file, and then print its lines in reverse order, the last line first. Or you may want to look up the functions push, pop, shift, and unshift, and choose one or more of them to accomplish this exercise. You may want to look ahead to Chapter 5 so you can use a loop in this program, but this may not be necessary depending on the approach you take. Or, you may want to use reverse on an array of lines.




1   2   3   4   5   6   7   8   9   10   ...   28


The database is protected by copyright ©hestories.info 2017
send message

    Main page