Exercise, Bioinformatics 1

Pairwise sequence comparison with dot plots

Dotter is a  graphical dotplot program for detailed comparison of two sequences. It generates a dotplot where the first sequence to be compared is set along the x-axis, and the second sequence along the y-axis. If two sequences are similar in two regions, a dot is plotted with an intensity proportional to the similarity.

Dotter is very useful for things like:
identifying  sequence repeats inside DNA and proteins.
identifying functional domains.
identifying coding regions in genomic DNA.

To run dotter type "dotter" followed by two file names:
dotter [file1] [file2] &
and press enter (the "&" is optional - puts the job in the background). You get three windows: the main dot plot, a sequence window showing the aligned sequences around the cross of the lines and a graymap tool where you can adjust the display of dots.

If you right-click in the dot plot window you get the main menu.

The sequence files must be in Fasta (pronounced Fast A) format  and may be both protein sequences, both DNA sequences, or one DNA sequence and one protein sequence (where the protein must be the second). Below there are links to some Fasta files that you can download (shift + left mouse button).


Exercise 1: Dotplot of amino acid sequences: elucidation of  protein functional domains inside protein kinases.

The integral membrane proteins in the HER2 family are members of the tyrosine kinase family and are important diagnostic tools and targets for antibody-based therapy of breast cancer.  If you  want to know more about protein kinases, protein phosphorylation and cellular regulation, read  a excellent review  given by Edmon H. Fischer in his Nobel lecture, part I and part II

A. Compare HER2/ErbB2 against itself and HER3/ErB3. Notice the patterns in the dot plot and try to find functional domains, for example cysteine rich regions. Try to vary the display with the graymap tool to find the most similar regions and move the cross with the mouse/arrows to see the matching sequences. (You can find a similar pattern when comparing ErbB2 against HER1/ErbB1 and HER4/ErbB4). (You can compare to the domain structure found here to see what the domains are called and a description of their function.)

B. Now compare HER2/ErbB2 with ErbB2-dog. Do you see the same pattern? There is an insertion in the C-term of one of the sequences. Which sequence has the insertion and what is it? Use the dotter sequence window to view the sequences. You can make a magnified view by dragging out a region with the middle mouse button (makes a new dot plot).


Exercise 2: Dotplot of DNA: Locating exon/intron boundaries (splice sites).

Gamma-amino-butyric acid receptors  (GABAARs) are inhibitory receptor-ion channels expressed in interneurons of the brain and belong to a family  of neurotransmitter or  ligand activated receptors.

Run a dot plot of the mRNA (cDNA) from the  human GABAAR sequence against part of the genomic DNA sequence from human chromosome 5.  Intron-exon regions should be seen easily. You can zoom in by pressing the middle mouse button and dragging.

A. Are the splice sites consensus splice sites? In vertebrates, the intron starts with GT and ends with AG, which are called consensus splice sites.

B. Most of the cDNA matches exactly to the DNA. What's going on at the 3' end of the cDNA?

C. Do a similar dotplot with the genomic DNA vs the GABAAR protein. What is it you are seeing in the sequence window? Is it easy to spot exact splice sites this way. The cDNA matched the genomic DNA further downstream. Why is the protein shorter? (You could do a dotplot of the cDNA vs the protein).


Exercise 3: Inverted repeats and low complexity regions.

A. Run a dot plot of genomic DNA from human chromosome 5 against itself (save it for exercise 3C with option -b, see "NOTES" below).
Find inverted repeats. They are characterized by lines perpendicular to the diagonal.
Try also to find stretches of low complexity regions (shows up as  larger "black boxes" in the dot plot), for example long streches of CTTTCTTT......

B.
(optional) Find the human GABAAR alpha1 homolog in the mouse genomic DNA by searching GenBank at  NCBI . For example,  use the keyword GABRA1. The gene is on chromosome 11. Extract a sequence containing the mouse gene in FastA format.
Run a dot plot of human genomic DNA against mouse genomic DNA.
Do you get useful information from this plot?

C. (optional): Can you characterize the region around the DNA repeat found in the dot plot of genomic DNA from human chromosome against itself around position 21600, 21600.  E.g., is it Alu-like, a psudogene or transposon-like? (We do not know the answer!)
Examples of DNA repeats can be found "here".

NOTES

Without extra specifications, dotter runs but does not create an output file. To specify that you want the output data stored in a new file run the command:
> dotter -b[result.file] [file1] [file2]

To view the result type the following at the command line:
> dotter -l[result.file] [file1.dna] [file2.dna]



Made by Svend Erik Westh Hansen, autumn 2002
Modified by AK, autumn 2003