Bioinformatik 1 exercise

week 43 (21/10)

Pairwise alignment


The programs in this excersice use the Needleman-Wunsch global alignment algorithm (needle) and the Smith-Waterman algorithm  (water) to calculate the local alignment.

These two programs are included in the EMBOSS  package which can be executed either locally  or online through the biological software package at  "Institut Pasteur". The programs can be found under "Sequences Alignment and Comparisons". These programs can also be executed at "European Bioinformatics Institute"  EMBL-EBI .

As to make possible comparisons between alignment results from the  dot plot exercise  and global or local pairwise alignments,  the "epidermal growth factor receptor" EGFR and the tyrosine kinase-type cell surface receptor HER2 is reused.

The Needleman-Wunch  and Smith-Waterman algorithms are members of the class of algorithms that calculate the best score  alignment, global or local, respectively, in the order of mn steps, where n and m are the lengths of the two sequences.

An important problem is the treament of gaps , i.e., spaces inserted to optimize the alignment score. A penalty is subtracted from the score for each gap opened ( the 'gap open' penalty) and a penalty is subtracted from the score for the total sum of gap spaces multiplied by a cost (the 'gap extension' penalty). Typically, the cost of extending a gap is set to be 5-10 times lower than the cost for opening a gap. The default value are using the BLOSUM62 matrix for protein sequences, and  the DNAFULL matrix for nucleotide sequences.

A general introduction to database searching can be found here.

Links to documentation of the programs: needle, water


Excersise 1
.

Align HER2 _ERB2_HUMAN to EGFR_DROME and UNKNOWN_AAL39899.1 with neddle and water. Copy and paste the sequences into the program at one of the web locations above (EBI is probably easiest).

What is the main difference between the two types of alignment in these two cases?

Now save the 3 sequences in files and try to align them with the locally installed programs. The whole Emboss package is installed in /net/emboss/bin/. You can either add that directory to your path [set path = ( $path /net/emboss/bin/ ); rehash] or use the whole path of the program:
needle HER2-fasta.prt ALL39899_1.prt
if you have set your path, or
/net/emboss/bin/needle HER2-fasta.prt ALL39899_1.prt
otherwise.

The output comes in a file.

Repeat the Smith-Waterman alignment of  HER2-fasta.prt ALL39899_1.prt with different parameters.

What happens if gap penalties are changed to 30 and 2 instead of the defaults 10 and 0.5?

BLOSUM62 is default. What happens to the alignment when using other matrices, e.g. PAM10?
water -datafile EPAM10 HER2-fasta.prt ALL39899_1.prt
What is the difference between PAM10 and BLOSUM62? You can see the matrices in /net/emboss/share/EMBOSS/data/.


Excersice 2.

Make a local protein alignment with  the HER2 protein  Spongilla_tyrosine_kinase.prt against all the proteins found in the  E.coli protein genome. A database with the proteins has already been downloaded, therefore, you can type:
/net/emboss/bin/water HER2-fasta.prt /net/data/Ecoli.prot.fasta
You can get all the scores from the file with this perl command:
perl -ne 'print "$1\n" if (/^# Score:\s*([0-9]+)/);' <output file>
This you can pipe into sort for instance (" | sort -nr") to find the largest scores. Make a histogram of the scores in R.

Check out the highest scoring matches. You can read about the matching proteins in the Swiss-prot database. Try to interpret the matches.

Repeat all this with the protein Yeast_DIE2.prt.