SAME97 Profile Searching Practical

edited by Roderic Guigó

EMBO Practical Course on Sequence Analysis and Molecular Evolution

Profile Searching Practical

by Toby Gibson, Ewan Birney and Des Higgins, 9/7/97

In this practical we will use profile search tools available through the web. Profile searches are one of the most sensitive search tools currently available. The raw material for profile searching is a multiple sequence alignment. A profile scores the amino acids at each position in the alignment: conserved positions score more strongly than unconserved ones (whereas in a single sequence, they are all equally significant). We can look at the effect of setting up the profile with different residue substitution matrices. We can compare the sensitivity to a search with a single sequence as query.

WWW DB Tools

We will use:

SRS (at EBI or EMBL) to extract a query sequence.
WWWProfileWeight to make the profiles from a sequence alignment.
Bioccellerator (at EBI or EMBL1 or EMBL2) for the Profile Searches.

Bioccelerators are installed at both EBI and EMBL, which should be useful in case of server or network problems. The servers are not identical and if you try more than one you may notice some differences.

Step 0 Build a TFIIB multiple alignment

Load SRS (at EBI or EMBL) and Start the session.

Select the Swiss-Prot database and continue.
TypeTFII & beta in the Description box, then Do Query.
Clik on save.
Sequence format is fasta and view FastSeqs. Then click on save.
Netscape save on e.g. tfIIb.fa

Load Clustal W at at EBI
With Upload a file read in the TFIIb set.
Now RUN CLUSTAL W
Use JalView for a Java interface to the alignment
Save the .aln file on e.g. tfIIb.aln

Step 1 Preparing a profile from a TFIIB alignment

TFIIB is a core transcription factor in both eukaryotes and archaea which has been quite strongly conserved through evolution. TFIIB has a ~90 residue duplicated domain, the TFIIB repeats, with N- and C-terminal extensions. A second protein family in eukaryotes (but not found in archaea) shares the same structural topology, and presumably shares common ancestry, although the function is not conserved. Well-optimised searches with TFIIB queries should be able to find this second family, which has many divergent entries, and the number of entries that are picked up is a measure of the search sensitivity.

Load WWWProfileWeight.
Upload the alignment file tfIIb.aln
Select the Blosum62 matrix.
Give the Range of Alignment as Begin=140 and End=341 (the TFIIB repeats only).
Run ProfileWeight to make the profile.
Look at the resulting profile:
- (a) See how scores for amino acids vary for each position in the alignment.
- (b) See how the position-specific gap penalties are lowered at existing gaps.
- (c) Note the suggested gap penalties in the header: these are only a rough guide.
Save Output to save the profile to a file (e.g. tfIIb.prf) for use in the profile search.

Step 2 BIC_Profilesearch with a TFIIB profile prepared with the Blosum62 matrix

The Bioccellerator is fast dedicated hardware exclusively designed for dynamic programming (ie. slow but sensitive) sequence comparison. It is built by the company Compugen. It can perform a number of search permutations including basic Smith-Waterman, profile searches and Protein v. DNA frame-shifting comparisons. Today we will do the Profile Search, which finds the best matching segments between a query profile (derived from a multiple alignment) and a database sequence, allowing for gaps to be inserted at any position.

Load a Bioccellerator home page (at EBI or EMBL1 or EMBL2).
Go to the Searches Page.
Select Profilesearch.
Select Profilesearch in the Application box.
Upload the TF2B profile.
Give Gap opening penalty 1.0 and extension penalty 0.1.
Select 50 alignments.
Select the Swiss-Prot database.
Do Search.
Save the output to a new file, so that you do not loose it.

The search will take a couple of minutes (unless the Bic is busy). When it is finished you can look at the high-score list and alignments in the output.

Questions

1. How are TFIIB entries distributed in the output?
2. Are other classes of proteins consistently detected?
3. How many of these are present in the top 50 hits?
4. What is the top entry that does not belong to these families

Step 3 BIC_Profilesearch with the TFIIB profile prepared with the Gonnet Pam250 matrix

Now repeat the search but use a profile made with a softer matrix: ie a matrix that weights similar residue exchanges more highly.

First make a new profile:

Reload the WWWProfileWeight query page.
Give the file tfIIb.aln in the Upload box.
Select the Gonnet250 matrix.
Give the Range of Alignment as Begin=140 and End=341 (the TFIIB repeats only).
Run ProfileWeight to make the profile.
Save Output to save the profile to a file (e.g. tfIIb_250.prf) for use in the profilesearch.

Now run another search:

Open a new netscape web browser and load this page into it.
Load a Bioccellerator home page (at EBI or EMBL1 or EMBL2).
Go to the Searches Page.
Select Profilesearch in the Application box.
Upload the new TF2B profile (tfIIb_250.prf) .
Give Gap opening penalty 1.0 and extension penalty 0.1.
Select 50 alignments.
Select the Swiss-Prot database.
Run Search.
Save output to a new file, so that you do not loose it.

Now you can compare the results of the Gonnet Pam 250 and Blosum62 matrices.

Questions

1. How many of the other family of proteins are detected in the top 50 hits?
2. Are the output alignments longer or shorter than before?
3. Which matrix provided the most sensitive search?

Step 4 Bic_SW search with the human TFIIB sequence

Now set up a search with TFIIB_Human, in order to determine whether profile searches are really more sensitive than a single sequence as query.

If needed, open a new netscape web browser and load this page into it.
Load SRS (at EBI or EMBL) and Start the session.
Select the Swiss-Prot database and continue.
Type TFIIB in the Description box, then Do Query.
Click on the entry TF2B_Yeast.
Now load a Bioccellerator home page (at EBI or EMBL1 or EMBL2).
Go to the Searches Page.
Select Cut and Paste for the query.
Paste in residues 125-324 of the TF2B_Yeast from tfIIb.fa file
Setup: the Swiss-Prot database, 50 Alignments, Gonnet Pam250 matrix.
Run Search: Bic_SW will start by default.
Save output to a new file, so that you do not loose it.

Now you can compare the results of the single sequence with the profile query.

Questions

1. How many of the other family of proteins are detected?
2. How did this search perform compared to the profile searches?

Take Home Lessons

Optimisation of the search setup is vital: again in practice this means running test searches. Choosing a good residue substitution matrix is important. Optimisation of gap penalties is also critical (we did not look at this today). The TF2B profile is actually only slightly more sensitive than the most optimised query with a TFIIB sequence. (We were a bit wicked and chose a poor starting query: TF2B_Human does much better, as yeast has diverged more from the common ancestral sequence). However, by adding in the BRF sequences (entries TF3B*) and then the best alignable cyclins, we would bring in more and more divergent cyclins. The RB sequences are also genuine hits, but have only a single domain. Reciprocal searches with profiles based on the cyclin box, and the conserved motif in the RB family would need to be undertaken: in each case they would support the idea that these families are related. How this was done in practice, and some tips on setting up and evaluating profile searches, are given in the references below.

References

Gibson, T. J., Thompson, J. D., Blocker, A. and Kouzarides, T. (1994) Evidence for a protein domain superfamily shared by the cyclins, TFIIB and RB/p107. Nucleic Acids Res., 22, 946-952.
Bork, P. and Gibson. T. J. (1996) Applying motif and profile searches. Methods Enzymol., 266, 162-184.