Exercise, Bioinformatics 1
Pairwise sequence comparison with dot plots
Dotter
is a graphical dotplot program for detailed comparison of two
sequences. It generates a dotplot where the first sequence to
be compared is set along the x-axis, and the second sequence along the
y-axis. If two sequences are similar in two regions, a dot is plotted
with
an intensity proportional to the similarity.
Dotter is very useful for things like:
identifying sequence repeats inside DNA and proteins.
identifying functional domains.
identifying coding regions in genomic DNA.
To run dotter type "dotter" followed by two file names:
dotter [file1] [file2] &
and press enter (the "&" is optional - puts the job in the
background). You get three windows: the main dot plot, a sequence
window showing
the aligned sequences around the cross of the lines and a graymap tool
where
you can adjust the display of dots.
If you right-click in the dot plot window you get the main menu.
The sequence files must be in Fasta (pronounced Fast A) format
and may be both protein sequences, both DNA sequences, or one DNA
sequence and
one protein sequence (where the protein must be the second). Below
there are
links to some Fasta files that you can download (shift + left mouse
button).
Exercise 1: Dotplot of amino acid sequences: elucidation
of
protein functional domains inside protein kinases.
The
integral membrane proteins in the HER2 family are members
of the
tyrosine
kinase family and are important diagnostic tools and targets for
antibody-based
therapy of breast cancer. If you want to know more
about protein kinases, protein phosphorylation and cellular regulation,
read a excellent review given by Edmon H. Fischer in his
Nobel lecture,
part
I and
part
II
A. Compare
HER2/ErbB2
against
itself and
HER3/ErB3. Notice the
patterns
in the dot plot and try to find functional domains, for example
cysteine
rich regions. Try to vary the display with the graymap tool to find the
most
similar regions and move the cross with the mouse/arrows to see the
matching
sequences. (You can find a similar pattern when comparing ErbB2 against
HER1/ErbB1 and
HER4/ErbB4). (You can compare to the
domain
structure found here to see what the domains are called and a
description
of their function.)
B. Now compare
HER2/ErbB2
with
ErbB2-dog. Do you
see the same pattern? There is an insertion in the C-term of one of the
sequences. Which sequence has the insertion and what is it? Use the
dotter sequence window to view the sequences. You can make a magnified
view by dragging out
a region with the middle mouse button (makes a new dot plot).
Exercise 2: Dotplot of DNA: Locating exon/intron boundaries
(splice sites).
Gamma-amino-butyric acid receptors (GABAARs) are inhibitory
receptor-ion channels expressed in interneurons of the brain and belong
to a family of neurotransmitter or ligand activated
receptors.
Run a dot plot of the
mRNA (cDNA) from
the
human GABAAR sequence against part of the genomic DNA sequence from
human chromosome 5. Intron-exon regions
should be seen easily. You can zoom in by pressing the middle mouse
button
and dragging.
A. Are the splice sites consensus splice sites? In vertebrates,
the
intron starts with GT and ends with AG, which are called consensus
splice
sites.
B. Most of the cDNA matches exactly to the DNA. What's going on
at
the 3' end of the cDNA?
C. Do a similar dotplot with the genomic DNA vs the
GABAAR protein. What is it you are seeing in
the
sequence window? Is it easy to spot exact splice sites this way. The
cDNA
matched the genomic DNA further downstream. Why is the protein shorter?
(You
could do a dotplot of the cDNA vs the protein).
Exercise 3: Inverted repeats and low complexity regions.
A. Run a dot plot of genomic DNA from
human
chromosome 5 against itself (save it for
exercise 3C with option -b, see "NOTES" below).
Find inverted repeats. They are characterized by lines perpendicular to
the diagonal.
Try also to find stretches of low complexity regions (shows up as
larger "black boxes" in the dot plot), for example long streches of
CTTTCTTT......
B. (optional) Find the human GABAAR alpha1 homolog in the mouse
genomic
DNA by searching GenBank at
NCBI . For
example, use the keyword GABRA1. The gene is on chromosome 11.
Extract a sequence containing the mouse gene in FastA format.
Run a dot plot of human genomic DNA against mouse genomic DNA.
Do you get useful information from this plot?