Sequence alignment exercise

Sequence analysis part II

This exercise is part of session to illustrate the use of sequence comparison in bioinformatics. 
 Exercises part I   shows sequence analysis as carried out the molecular biologist. This exercise is a more algorithmic in character and it is different depending on if you have any previous programming experience or not.

Without programming experience

1. Download this C-program and place it in a file that has the name alignment.c. You can check what files you have in your directory by typing the command 'ls'. By typing 'cat alignment.c' or 'more alignment.c' you can look at the source code. You might also find it convenient to use an editor (emacs, vi, etc.): e.g. type 'emacs alignment.c' or 'emacs alignment.c &'. Compile the program using the command

 gcc -o align alignment.c

 This results in a binary executable (machine code file) with the name align. This can be run by typing 'align'. Enter some short DNA-string and see what happens. What does the program do? The program can be stopped by CTRL-C.

 Try out the program by entering different strings, and provoking gaps in both sequences.

2. Carefully examine the source code of the program and try to understand what is does in every step and how it works. This will not be so easy, but it is a very good exercise to try to do this. First just try to get a rough idea of what is going on in the different parts of the program, then successively try to understand more and more details. If something seems very difficult, skip it and come back to it later. Some of this work may have to be done also after the supervised session, but ask as much as possible while you can. Some explanations of C code syntax that might help:

3. Write a short description of the different parts of the code and what they do. Also explain some tricky details that you managed to understand.

With programming experience

1. Download, compile and run this C-program. What does the program do? Try out the program by entering different strings, and provoking gaps in both sequences. Check the source code so that you get the idea of how it works.

 2. In the lectures we have seen how we can solve the alignment problem for an arbitrary matrix with pairwise matching scores, and a gap cost. Unfortunately, the assumption that the gap cost is proportional to the length of the gap is not biologically accurate, since if an insertion or deletion takes place it could almost as well be a longer piece that is inserted or deleted. A more accurate way to model this could be to have a high fixed cost d for the first position of a gap, and then a lower cost e for remaining positions. The question is now how the algorithms could be generalized to handle this case?  Try different value combinations for the generalized version of the program. What do you observe?


3. Sometimes there exist several alignments with equal score. Suggest a modification of the algorithm that computes all optimal alignments.

4. Since the algorithm and the scoring model are build on several assumptions that might not reflect reality in every detail, it can be interesting to compute other alignments which have an almost optimal score. Suggest a modification of the algorithm that either computes the 10 best alignments or that computes all alignments which are better than the optimal score minus 5.

Last modified: Fre Sept 20 03:01:33 MET 2002 by Birgit Grohe. This document is based on a document by Dag Wedelin.