Practical 18: Sequence similarity
searches (II)
Bioinformatics 2004/2005
PSI-BLAST
PSI-BLAST (Position-Specific Iterated BLAST) is a tool that
produces a position-specific scoring matrix constructed from a
multiple alignment of the top-scoring BLAST responses to a given query
sequence. This scoring matrix produces a profile designed to identify
the key positions of conserved amino acids within a motif. When a
profile is used to search a database, it can often detect subtle
relationships between proteins that are distant structural or
functional homologues. These relationships are often not detected by a
BLAST search with a sample sequence query.
For a simple example of what a consensus sequence, or profile, looks like, consider
that the EF-hand binding loop of the calmodulin family.
A multiple alignment of this region for three calmodulin proteins is as follows:
POSITION # 1 3 4 6 8 12
CALM_HUM ...D K D G D G T I T T K E
CALM_SCHPO ...D R D Q D G N I T S N E
CALM_YEAST ...D K D N N G S I S S S E
The profile of this region can be
represented as follows:
Loop Position # 1 3 5 6 8 12
Profile D x D x D/N G x I x x x E
Here x stands for positions where there is variability in amino acid
type, and therefore, that position is not heavily weighted in the
alignment.
Exercise
- We have the protein sequences for CALM_HUM, CALM_SCHPO and CALM_YEAST in a multi-fasta file.
- Copy and paste the sequences in a multi-fasta file.
- Go to CLUSTALW at the European
Bioinformatics Institute (EBI) and obtain a multiple sequence alignment.
- Click on "Display colours" nn the result page from ClustalW.
- Verify that the loop is aligned as given above.
The rules for deriving this simple profile are:
- any position with 90% amino acid identity or greater is considered conserved in the
profile, and thus a higher score would be given when the conserved
amino acid is found at that position in the sequence.
- any position that always contains one of only two types of amino acids
would be up-weighted to give a higher score whenever either of those
two amino acids appears at that position.
PSI-BLAST employs more sophisticated rules to create a profile than
this example. This profile guides the sequence similarity search and
increases the sensitivity. In the first round, PSI-BLAST is just like
a normal BLAST; it finds sequence homologues. In the second round or
"iteration" of PSI-BLAST, it figures out which residues tend to be
conserved by creating a custom profile for each position of the
sequence from a multiple alignment. Then another BLAST is performed,
using the profile to produce a position-specific scoring matrix based
on which positions evolution has conserved vs. which positions
evolution has allowed to vary. The sequences found after the first
round are added to the profile, allowing PSI-BLAST to detect more
distant homologues in each iteration. One of the known weaknesses of
PSI-BLAST is that its ability to detect distant relationships between
proteins is critically dependent on the choice of the query
sequence. For this reason, a recommended strategy with PSI-BLAST is to
query using individual functional domains. PSI-BLAST will then find
other proteins that share this domain, even if they do not possess
overall homology.

There are three common categories of homologues that are studied in
relation to biological molecules, sequence homology, structural
homology, and functional homology. Sequence homology is the easiest to
identify, and is therefore the primary target of many bioinformatics
methods. Sequence homology yields direct implications about the
relatedness of proteins and their potential pathways of
derivation. However, to help understand how a protein is implicated in
a certain disease state, or how to design a pharmaceutical that
interacts with a given protein, functional and/or structural
information is necessary. Functional homologues are relatively easy to
define, as they are any two proteins, or protein domains, that perform
similar functions. Structural homologues contain similar "folds",
which are localized regions of a molecule that comprise a structural
feature such as a "beta barrel" or "four helical bundle" motif. The
fold can encompass the entire protein, or just one domain of the
protein. When considering sequence, functional, or structural
homology, it is important to understand that one type of homology
between proteins does not always infer another type of
homology. Nevertheless, it is a reasonable assumption that proteins
that are related through evolutionary pathways are likely to have some
degree of all three types of homology. PSI-BLAST was engineered to
identify distant relationships between sequences that are too subtle
to discover with a regular BLAST search.
For more information, see the
BLAST/PSI-BLAST Tutorial at NCBI
Case study: the Kua-UEV protein
1. Identification of homologues of Kua-UEV using PSI-BLAST
- Go to NCBI www.ncbi.nlm.nih.gov and get the entry of the protein
with identifier NP_954673 (ubiquitin-conjugating enzyme E2 Kua-UEV isoform 1) (remember to select Protein in the left menu).
Select Display in Fasta Format. Keep this sequence in a text file.
- Go to the NCBI web page www.ncbi.nlm.nih.gov, select
BLAST in the menu above. Identify the different blast options.
Select PSI-BLAST from the menu.
- This works like a normal blast search. Paste the query sequence
in the search box and click on the BLAST button.
- Click on FORMAT to display the PSI-BLAST results (first round).
- What do the results tell us about the structure of this
protein? Can you find any hit that covers the whole length of the
protein?
- Go to the results page and run PSI-BLAST iteration 2.
- How many new hits do we have? How can we identify them?
- Let's run PSI-BLAST iteration 3. We could run more iterations until
no more hits are found (this happens at iteration 7)
- Identify in the output the proteins with Genebank accession IDs AAF08702
(Kua) and NP_071887 (E2 variant 1 isoform c). Retrieve the sequences in
Fasta format and keep them in a text file.
2. Alignment of Kua-UEV, Kua, UEV and
protein gi:57104280.
We have seen that the matches to the query protein distribute in
two regions. We take the proteins Kua (AAF08702) and UEV variant 1 (NP_071887)
and align them to the query:
3. Gene organisation of Kua and UEV in Homo sapiens
- Read the first one of the abstracts below.
- In order to understand the transcription of this gene,
go to the Ensembl web site www.ensembl.org and select as species "Human".
- Enter UEV in the box
- Click on Ensembl Gene ENSG00000124208 and examine the different transcripts.
Kua-UEV isoform 1 from Homo sapiens is the result of a transcript
produced from two adjacent genes, Kua and UEV isoform 1.
- You can view the same organisation in the UCSC Genome browser
here.
4. Identification of homologues of Kua
and UEV separately
- Kua-UEV isoform 1 from Homo sapiens is the result of a transcript
produced from two adjacent genes, Kua and UEV isoform 2.
- Perform sequence similarty searches separately with AAF08702 (Kua) and NP_071887 (UEV variant 1)
using only the first round of PSI-BLAST.
- For each one, for to "Taxonomy Reports" (top of the page).
- Do you see any differences between the taxonomic distribution of these two
proteins?
- Identify in the output of searches against Kua a protein called CarF
from the bacteria Myxococcus xanthus.
gi|13397952|emb|CAC34626.1| hypothetical protein [Myxococcus xanthus]
gi|27804817|gb|AAO22861.1| CarF [Myxococcus xanthus]
Length = 281
Score = 180 bits (456), Expect = 2e-44
Identities = 107/239 (44%), Positives = 134/239 (56%), Gaps = 22/239 (9%)
Query: 31 ARELAALYSPGKRLQEWCSVILCFSLIAHNLVHLLLLARWEDTPLVILGVVAGALIADFL 90
A+ LA YSP R E + I+ F + LV+ L + T L++ V+ G L ADF+
Sbjct: 15 AQVLAQGYSPAIRAME-IAAIVSFVSLEVALVYRLWGTPYAGTWLLLSAVLLGYLAADFV 73
Query: 91 SGLVHWGADTWGSVELPI---AFIRPFREHHIDPTAITRHDFIETNGDNCLVTLLPLLNM 147
SG VHW DTWGS E+P+ A IRPFREHH+D AITRHDF+ETNG+NCL++L P+ +
Sbjct: 74 SGFVHWMGDTWGSTEMPVLGKALIRPFREHHVDEKAITRHDFVETNGNNCLISL-PVAII 132
Query: 148 AYKFRTHSPEALEQLYPWECFVFCLIIFGTF------TNQIHKWSHTYFGLPRWVTLLQD 201
A P +VFC G TNQ HKWSH P V LQ
Sbjct: 133 ALCLPMSGPG----------WVFCASFLGAMIFWVMATNQFHKWSHMD-SPPALVGFLQR 181
Query: 202 WHVILPRKHHRIHHVSPHETYFCITTGWLNYPLEKIGFWRRLEDLIQGLTGEKPRADDM 260
H+ILP HHRIHH P+ Y+CIT GW+N PL + F+ E LI TG PR DD+
Sbjct: 182 VHLILPPDHHRIHHTKPYNKYYCITVGWMNKPLTMVHFFPTAERLITWATGLLPRQDDI 240
- What can we say about the function of Kua/CarF? (read the second
abstract below)
References proteins Kua-UEV and Kua
homolog in Myxococcus xanthus (CarF):
Genome Res. 2000 Nov;10(11):1743-56.
Fusion of the human gene for the
polyubiquitination coeffector UEV1 with Kua, a newly identified gene.
Thomson
TM, Lozano JJ, Loukili N, Carrio R, Serras F, Cormand B, Valeri M, Diaz
VM, Abril J, Burset M, Merino J, Macaya A, Corominas M, Guigo R.
Institut de Biologia Molecular, Consejo Superior de Investigaciones
Cientificas, Barcelona, Spain. tthomson@hg.vhebron.es
UEV
proteins are enzymatically inactive variants of the E2
ubiquitin-conjugating enzymes that regulate noncanonical elongation of
ubiquitin chains. In Saccharomyces cerevisiae, UEV is part of the
RAD6-mediated error-free DNA repair pathway. In mammalian cells, UEV
proteins can modulate c-FOS transcription and the G2-M transition of
the cell cycle. Here we show that the UEV genes from phylogenetically
distant organisms present a remarkable conservation in their
exon-intron structure. We also show that the human UEV1 gene is fused
with the previously unknown gene Kua. In Caenorhabditis elegans and
Drosophila melanogaster, Kua and UEV are in separated loci, and are
expressed as independent transcripts and proteins. In humans, Kua and
UEV1 are adjacent genes, expressed either as separate transcripts
encoding independent Kua and UEV1 proteins, or as a hybrid Kua-UEV
transcript, encoding a two-domain protein. Kua proteins represent a
novel class of conserved proteins with juxtamembrane histidine-rich
motifs. Experiments with epitope-tagged proteins show that UEV1A is a
nuclear protein, whereas both Kua and Kua-UEV localize to cytoplasmic
structures, indicating that the Kua domain determines the cytoplasmic
localization of Kua-UEV. Therefore, the addition of a Kua domain to UEV
in the fused Kua-UEV protein confers new biological properties to this
regulator of variant polyubiquitination.
Mol Microbiol. 2003 Jan;47(2):561-71.
A novel regulatory gene for light-induced carotenoid
synthesis in the bacterium Myxococcus xanthus.
Fontes M, Galbis-Martinez L, Murillo FJ.
Departamento de Genetica y Microbiologia, Facultad de Biologie,
Universidad de Murcia, Spain.
Myxococcus
xanthus cells respond to blue light by producing carotenoids. Light
triggers a network of regulatory actions that lead to the
transcriptional activation of the carotenoid genes. By screening the
colour phenotype of a collection of Tn5-lac insertion mutants, we have
isolated a new mutant devoid of carotenoid synthesis. We map the
transposon insertion, which co-segregates with the mutant phenotype, to
a previously unknown gene designated here as carF. An in frame deletion
within carF causes the same phenotype as the Tn5-lac insertion. The
carF deletion prevents the activation of the normally light-inducible
genes, without affecting the expression of any of the regulatory genes
known to be expressed in a light-independent manner. Until now, the
switch that sets off the regulatory cascade had been identified with
light-driven inactivation of protein CarR, an antisigma factor. The
exact mechanism of this inactivation has remained elusive. We show by
epistatic analysis that the carF gene product participates in the
light-dependent inactivation of CarR. The predicted CarF amino acid
sequence reveals no known prokaryotic homologues. On the other hand,
CarF is remarkably similar to Kua, a family of proteins of unknown
function that is widely distributed among eukaryotes.
Page mantained by Mar Albà and Eduardo Eyras,
February 2006