Sequence analysis part II
This exercise is part of session to illustrate the use of sequence comparison
in bioinformatics.
Exercises
part I shows sequence analysis as carried out the molecular
biologist. This exercise is a more algorithmic in character and it is different
depending on if you have any previous programming experience or not.
Without programming experience
1. Download this
C-program and place it in a file that has the name alignment.c. You
can check what files you have in your directory by typing the command 'ls'.
By typing 'cat alignment.c' or 'more alignment.c'
you can look at the source code. You might also find it convenient to use
an editor (emacs, vi, etc.):
e.g. type 'emacs alignment.c' or 'emacs alignment.c &'.
Compile the program using the command
gcc -o align alignment.c
This results in a binary executable (machine code file) with the
name align. This can be run by typing 'align'. Enter some short DNA-string
and see what happens. What does the program do? The program can be stopped
by CTRL-C.
Try out the program by entering different strings, and provoking
gaps in both sequences.
2. Carefully examine the source code of the program and try to understand
what is does in every step and how it works. This will not be so easy,
but it is a very good exercise to try to do this. First just try to get
a rough idea of what is going on in the different parts of the program,
then successively try to understand more and more details. If something
seems very difficult, skip it and come back to it later. Some of this work
may have to be done also after the supervised session, but ask as much
as possible while you can. Some explanations of C code syntax that might
help:
-
#define N 7 Defines the constant N to be the value 7.
-
/* a comment is written like this */
-
{} are used to group several instructions.
-
int i; Declares that the program should reserve one memory cell for an
integer variable and call it i.
-
int a[100]; Declares that the program should reserve 100 consecutive memory
cells for integer variables which can then be referred to as a[0] to a[99]
(NOT a[1] to a[100]!).
-
int b[10][10]; Reserves an integer matrix. The rows are stored after each
other in memory.
-
char a; Declares a character variable.
-
printf and scanf are predefined input/output functions. Don't worry about
the details.
-
i=k+2; Computes the value of k+2 and puts it in the memory location called
i. This is called an assignment operation.
-
i++; Increase i by 1.
-
if(i==4) some instruction; If i is 4 do the instruction. Note that == is
a logical test and not an assignment. != means "not equal to".
-
while(i==4) some instruction; Repeat the instruction until the condition
i==4 is no longer satisfied.
-
for (i=0; i<=100; i++) some instruction; First set i to 0 and do the
instruction. Then set i to 1 and do the instruction again. Repeat for all
i up to and including 99.
3. Write a short description of the different parts of the code and what
they do. Also explain some tricky details that you managed to understand.
With programming experience
1. Download, compile and run this
C-program. What does the program do? Try out the program by entering
different strings, and provoking gaps in both sequences. Check the source
code so that you get the idea of how it works.
2. In the lectures we have seen how we can solve the alignment
problem for an arbitrary matrix with pairwise matching scores, and a gap
cost. Unfortunately, the assumption that the gap cost is proportional to
the length of the gap is not biologically accurate, since if an insertion
or deletion takes place it could almost as well be a longer piece that
is inserted or deleted. A more accurate way to model this could be to have
a high fixed cost d for the first position of a gap, and then a lower cost
e for remaining positions. The question is now how the algorithms could
be generalized to handle this case?
Try different value combinations for the generalized version of the
program. What do you observe?
Optional
3. Sometimes there exist several alignments with equal
score. Suggest a modification of the algorithm that computes all
optimal alignments.
4. Since the algorithm and the scoring model are build on several
assumptions that might not reflect reality in every detail, it can be
interesting to compute other alignments which have an almost optimal
score. Suggest a modification of the algorithm that either computes
the 10 best alignments or that computes all alignments which are
better than the optimal score minus 5.
grohe@cs.chalmers.se
Last modified: Fre Sept 20 03:01:33 MET 2002
by Birgit Grohe. This document is based on a document by Dag Wedelin.