Bioinformatik 1 exercise
week 43 (21/10)
Pairwise alignment
The programs in this excersice use the Needleman-Wunsch global alignment
algorithm (needle) and the Smith-Waterman algorithm (water) to calculate
the local alignment.
These two programs are included in the EMBOSS package
which can be executed either locally or online through the biological
software package at "Institut Pasteur". The
programs can be found under "Sequences Alignment and Comparisons". These
programs can also be executed at "European Bioinformatics Institute"
EMBL-EBI .
As to make possible comparisons between alignment results from the dot plot exercise
and global or local pairwise alignments, the "epidermal growth
factor receptor" EGFR and the tyrosine kinase-type cell surface receptor
HER2 is reused.
The Needleman-Wunch and Smith-Waterman algorithms are members
of the class of algorithms that calculate the best score alignment,
global or local, respectively, in the order of mn steps, where n and m are
the lengths of the two sequences.
An important problem is the treament of gaps , i.e., spaces inserted to
optimize the alignment score. A penalty is subtracted from the score for
each gap opened ( the 'gap open' penalty) and a penalty is subtracted from
the score for the total sum of gap spaces multiplied by a cost (the 'gap extension'
penalty). Typically, the cost of extending a gap is set to be 5-10 times
lower than the cost for opening a gap. The default value are using the BLOSUM62
matrix for protein sequences, and the DNAFULL matrix for nucleotide
sequences.
A general introduction to database searching can be found here.
Links to documentation of the programs: needle,
water
Excersise 1.
Align HER2 _ERB2_HUMAN to EGFR_DROME and UNKNOWN_AAL39899.1 with neddle and water. Copy
and paste the sequences into the program at one of the web locations above
(EBI is probably easiest).
What is the main difference between the two types of alignment in these
two cases?
Now save the 3 sequences in files and try to align them with the locally
installed programs. The whole Emboss package is installed in /net/emboss/bin/.
You can either add that directory to your path [set path = ( $path /net/emboss/bin/
); rehash] or use the whole path of the program:
needle HER2-fasta.prt ALL39899_1.prt
if you have set your path, or
/net/emboss/bin/needle HER2-fasta.prt ALL39899_1.prt
otherwise.
The output comes in a file.
Repeat the Smith-Waterman alignment of HER2-fasta.prt ALL39899_1.prt
with different parameters.
What happens if gap penalties are changed to 30 and 2 instead of the defaults
10 and 0.5?
BLOSUM62 is default. What happens to the alignment when using other matrices,
e.g. PAM10?
water -datafile EPAM10 HER2-fasta.prt ALL39899_1.prt
What is the difference between PAM10 and BLOSUM62? You can see the matrices
in /net/emboss/share/EMBOSS/data/.
Excersice 2.
Make a local protein alignment with the HER2 protein Spongilla_tyrosine_kinase.prt against
all the proteins found in the E.coli
protein genome. A database with the proteins has already been downloaded,
therefore, you can type:
/net/emboss/bin/water HER2-fasta.prt /net/data/Ecoli.prot.fasta
You can get all the scores from the file with this perl command:
perl -ne 'print "$1\n" if (/^# Score:\s*([0-9]+)/);' <output
file>
This you can pipe into sort for instance (" | sort -nr") to find the largest
scores. Make a histogram of the scores in R.
Check out the highest scoring matches. You can read about the matching
proteins in the Swiss-prot database.
Try to interpret the matches.
Repeat all this with the protein Yeast_DIE2.prt.