Biological sequence analysis 1 exercise

Pairwise alignment

The programs in this excersice use the Needleman-Wunsch global alignment algorithm (needle) and the Smith-Waterman algorithm  (water) to calculate the local alignment.

These two programs are included in the EMBOSS package which can be executed either locally  or online through the biological software package at "Institut Pasteur". The programs can be found under "Sequences Alignment and Comparisons". These programs can also be executed at "European Bioinformatics Institute EMBL-EBI .

To make possible comparisons between alignment results from the dot plot exercise and global or local pairwise alignments,  the "epidermal growth factor receptor" EGFR and the tyrosine kinase-type cell surface receptor HER2 is reused.

Links to documentation of the programs: needle, water

It is a good idea to start by making a new directory to store everything.

Excersise 1

Align HER2 _ERB2_HUMAN to EGFR_DROME and UNKNOWN_AAL39899.1 with neddle and water. Copy and paste the sequences into the program at one of the web locations above (EBI is probably easiest).

What is the main difference between the two types of alignment in these two cases?

Now save the 3 sequences in files and try to align them with the locally installed programs. The whole Emboss package is installed in /net/emboss/bin/. You can either add that directory to your path [set path = ( $path /net/emboss/bin/ ); rehash] or use the whole path of the program:

needle HER2-fasta.prt ALL39899_1.prt
if you have set your path, or
/net/emboss/bin/needle HER2-fasta.prt ALL39899_1.prt
otherwise.

The output comes in a file (ls -t will show the latest file first).

Repeat the Smith-Waterman alignment of  HER2-fasta.prt ALL39899_1.prt with different parameters.

What happens if gap penalties are changed to 30 and 2 instead of the defaults 10 and 0.5?

BLOSUM62 is default. What happens to the alignment when using other matrices, e.g. PAM10?

water -datafile EPAM10 HER2-fasta.prt ALL39899_1.prt
What is the difference between PAM10 and BLOSUM62? You can see the matrices in /net/emboss/share/EMBOSS/data/.

Excersice 2.

Make a local protein alignment with  the HER2 protein  Spongilla_tyrosine_kinase.prt against all the proteins found in the  E.coli genome. A database with the proteins has already been downloaded, therefore, you can type:

/net/emboss/bin/water Spongilla_tyrosine_kinase.prt /net/data/Ecoli.prot.fasta
You can get all the scores from the file with this perl command:
perl -ne 'print "$1\n" if (/^# Score:\s*([0-9]+)/);' <output file>
This you can pipe into sort for instance (" | sort -nr") to find the largest scores. Make a histogram of the scores in R, excel or whatever you prefer.

Check out the highest scoring matches. You can read about the matching proteins in the Swiss-prot database. Try to interpret the matches.

Optional: Repeat all this with the protein Yeast_DIE2.prt.


Made by Svend Erik Westh Hansen, autumn 2002
Modified by AK, autumn 2003
Few changes by AK, autum 2004 and Nov. 2005