Bioinformatics 1 exercise

BLAST

In this exercise you should learn to use the BLAST program on the web. There are several different versions of blast, and you will try a few in this exercise. Blast is maintamed by NCBI. NCBI maintains many web services. You can for instance run a variaty of blast searches against various databases. There are also many ressources related to specific genomes.

The last part lets you run blast, fasta and Smith-Waterman locally.

Exercise 1 - protein vs. protein

Here is the human protein-tyrosine kinase we used in the dotter exercise: HER3/ErB3.

First, let us see if there is a homologous protein in the fruit fly. Go to the Blast page and chose protein blast search against the Drosophila genome database. Paste the fasta sequence into the window and do the search.

What do you think some of the short matches to other proteins are? Is it the same part of the protein that have these short matches.

Select the best match (follow the link) and save the sequence in fasta format. Try to find out what is known about the protein. For instance search the Swiss-prot database with this sequence and see if there is a match. The Swiss-prot database is the most well-annotated protein database.

There are sometimes X's in the query sequence in the Blast output. X means an unknown amino acid. They are not in the original sequence you submitted. Why are they there?

Exercise 2 - protein vs DNA

Now, we want to find out where the protein is in the genome. Go to the Blast page and chose protein vs DNA (translated blast searches, tblastn) and select the Drosophila genome again.

The begin and end coordinates of the matches are shown. Does it make sense? Try to figure out the relation between the DNA and the protein sequence shown.

Exercise 3 - DNA vs DNA

(If a search takes too long, you should be able to use the RID numbers below to see a previously done search.)

Searh with the  mRNA of the human gamma-amino-butyric acid receptor GABAAR cDNA alpha1 subtype sequence against the fly genome with standard nucleotide blast (RID: 1066943373-22450-1489687.BLASTQ3).

Repeat the search with tblastx (RID: 1066943711-25276-506358.BLASTQ3). Find some information on how tblastx works.

Compare the results on the maximal scoring sequences.

What is the important lesson to learn from this comparison???

Exercise 4 - Comparing blast, fasta and Smith-Waterman

In a previous exercise you ran a Smith-Waterman search with the "water" program of the HER2 protein  Spongilla_tyrosine_kinase.prt against a database of E. coli proteins. Find the search result or repeat it.

Now do the same search with blast:
blastall -p blastp -d /net/data/Ecoli.prot -i HER2-fasta.prt
which is the command line syntax for blast (running blastall with no arguments will show all the options you can use).
Is it faster than water?
Are there any very significant matches?
Are the most significant the same as the high-scoring with water?

Now do the same with the fasta program:
fasta HER2-fasta.prt /net/data/Ecoli.prot.fasta
What do you think the fancy graphics shows?

The fasta package also comes with an implementation of Smith-Waterman called "ssearch". Try it (same syntax as fasta).

Compare the four results. Why are there differences? Does the protein have a coli homolog?