Practical 20: Databases of Protein Domains
Proteins have a modular organisation. They are made up of
different regions with specific functions: protein (or functional) domains.
Each domain is characterized by one or more sequence motifs, which are related to
the function carried out by the domain.
Preservation of that function is what prevents the motif from gradually disappearing by the
accumulation of mutations during evolution. The same domain may be found in many
different proteins from the same organism and in many different organisms. For instance,
the RNA binding domain is one of the most abundant in all eukaryotes.
We can represent a conserved domains as a multiple alignment. From the
multiple alignment we can build a description of the sequence motif using a consensus sequence, a position
specific scoring matrices (or weight matrix) or a hidden markov model
(this will be studied in the course of structural biology).
A number of databases exists that store
information on known protein domains:
(Prosite, Pfam, Interpro, SMART, ...).
In Interpro (Mulder
et al., 2005) we can access different domain databases.
Using a formal representation of the domain (for example by a position
specific scoring matrix) we can search for other molecules that contain
the same domain in sequence databases. These searches are usually very
sensitive and allow us to detect remote homologies.
CLUSTAL W (1.82) multiple sequence alignment
ABL_CALVI/28-40 YIHRDLAARNCLV 13
ABL_DROME/505-517 YIHRDLAARNCLV 13
ABL_FSVHY/308-320 FIHRDLAARNCLV 13
ABL2_HUMAN/405-417 FIHRDLAARNCLV 13
ABL1_MOUSE/359-371 FIHRDLAARNCLV 13
ABL1_HUMAN/359-371 FIHRDLAARNCLV 13
ABL1_CAEEL/428-440 FIHRDLAARNCLV 13
7LES_DROME/2339-2351 FVHRDLACRNCLV 13
7LES_DROVI/2351-2363 FVHRDLACRNCLV 13
We will identify protein domains in a sequence of interest using the
resource of protein domains Interpro.
cloned the following
Part 1. Identify which gene this sequence corresponds to and get the
complete protein sequence
* Copy the sequence above and paste it in BLAST to identify the
complete gene entry (select blastn program)
* After the first page click Format and wait for the results.
* How many homologous sequences do we find with the BLAST search? From
* Click on an entry with 100% identity to our sequence (presumably the
gene it belongs to). Which is this gene?
* Go to the protein entry of this gene entry (Click on /protein_id )
* At Display select "FASTA" (at top of entry) and click "Display"
* Keep the sequence for the next section
(copy in a text file or leave this window open)
Part 2. Identify known domains on the protein
* Go to Interpro and
(menu at the left)
* Paste the protein sequence in the window and Submit Job
* How many Intepro domains hit the protein?
* Go to Table View.
* How can we know how reliable the hits to these domains are?
* Go to Raw Output
* Find the domain boundaries of motif PF00643
zf-B_box in our protein sequence
* Go back to Picture View
* Click on the Interpro entry defined
as "Zn-finger, B-box" (IPR000315
* Read the description of the domain
* How many proteins in the Interpro database
contain this domain?
* Go down the entry to see different domain
architectures of the proteins that contain this domain
* Click on the Pfam entry of the hit
corresponding to "Zn-finger,
B-box" (PF00643 zf-B_box)
* Go to Alignment and click on "Get alignment" and
"View HMM logo"
* Which are the best conserved residues?
* Go to Species Distribution
* Which is the species that
contains more proteins with this domain?
* Go to Domain organisation
* How many domain combinations also
contain the Bromodomain?
Part 3. Analyse the similarity to mouse protein B-raf
* Go back to BLAST results and click on the entry
">gi|553877|gb|M64429.1|MUSBRAF Mouse B-raf oncogene mRNA, complete
* Get the protein entry in FASTA format as before
* Go to ClustalW and
paste in the window the two protein sequences (TIF1 and B-raf) in fasta
* Analyse the protein alignment.
* Open a new browser window, go to Interpro and select
* Paste the B-raf protein sequence in the window and Submit Job
* Compare the results with those obtained with TIF1. Which proten
domains are shared and which are not?
* Compare the domain organization of the two proteins to the multiple
sequence alignment obtained with ClustalW.
Link to protein sequences in fasta format here.