1.
Characteristics of gene expression regulatory sequences
Transcription regulation is mediated by transcription factors,
many of
which recognize DNA motifs in a sequence-specific manner. This
binding to DNA motifs may activate the expression of a gene or it may
repress it.
The regulation involves interactions between different transcription
factors.
Regions upstream of the transcription start site typically
contain
several regulatory elements and are known as promoters. But in
multicellular organisms this scenario can be quite complex and a gene's
regulatory sequence can consist of several cys-regulatory modules,
discrete regions that drive the expression of the gene under particular
conditions or tissues.
Regulatory sequences have the potential to evolve quickly, as mutations
in one or a few nucleotides may lead to the loss or acquisition of a
new transcription factor binding site, and different cys-regulatory
modules can evolve independently.
Transcription factors contain DNA-binding domains for the
interaction with the DNA sequence. The interaction often involves the
major groove of the DNA molecule and is established through
hydrogen bonds and van der Waals interactions.
Examples of Regulatory Proteins
|
Type
|
Abbreviation
|
Example
|
helix-turn-helix
|
HTH
|
lac repressor, CRP (CAP)
|
basic leucine zipper
|
bZIP
|
CREB, AP1 (Fos, Jun)
|
zinc finger
|
zif
|
TFIIIA, Gal4
|
Distribution of eukaryotic transcription factor DNA-binding
domains:

Zinc finger domain (C2H2):
x x
x x
x x
x x
x x
x x
C H
x \ / x
x Zn x
x / \ x
C H
x x x x x x x x x x
2.
Transcription factor binding sites
Transcription factors bind to DNA motifs in regulatory
regions in a sequence-specific manner. The motifs are short and
variable. For this reason computational searches of DNA motifs are
often very unespecific, we will find many instances of the motif just
by chance.
example:
TATA site in the gene promoter
CLUSTAL W (1.82) multiple sequence alignment
seq3 TATAAA 6
seq7 TATAGA 6
seq8 TATAAA 6
seq2 TATAAA 6
seq5 GATAAA 6
seq6 TATAAA 6
seq1 TATAAA 6
seq4 TATAAT 6
***
We can capture this variability using consensus sequences, such as:
[T/G]ATA[A/G][A/T],
or position weight matrices
(PWM), such as:
TATA box PWM:
1 2 3
4 5 6
- - - - - - - - - - - - -
- - - - - - -
A 0 8
0 8 7 7
C 0 0
0 0 0 0
G
1 0 0
0 1 0
T
7 0 8 0
0 1
relative frequencies:
1 2
3
4 5
6
- - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - -
A 0
1
0 1
0.875 0.875
C 0
0
0 0
0 0
G
0.125 0
0 0
0.125 0
T
0.875 0 1
0
0 0.125
(each cell is Mij)
The use of PWM allows us to perform more
specific searches. We will not only be interested in whether a
nucleotide can appear in a particular position but also in the
frequency at which that is observed.
Motif logos:

3D structure:

TATA-binding protein (TBP) bound to the TATA site
Practical:
prediction of regulatory motifs in the cardiac alpha-actin gene
We have the following sequences of the upstream region of cardiac
alpha-actin gene:
>human_-360_to_-15
CTGCGGAGGACCGAATCCACAGACCATCCAGGGAGCACCCACACCCCAGAAAGGGGGAGGGGTGGGCTGGCGTCAC
TTAGTCTTCCCCTGCCCCCTACCCTTCAGCGCCTGCCCCTCCCCAGCTCCCTATTTGGCCATCCCCCTGACTGCCCCC
TCCCCTTCCTTACATGGTCTGGGGGCTCCCTGGCTGATCCTCTCCCCTGCCCTTGGCTCCATGAATGGCCTCGGCAGT
CCTAGCGGGTGCGAAGGGGACCAAATAAGGCAAGGTGGCAGACCGGGCCCCCCACCCCTGCCCCCGGCTGCTCCAA
CTGACCCTGTCCATCAGCGTTCTATAAAGCGGCCCTCCT
>mouse_-360_to_-12
TTGGAAGGGCTGAAGAGCAATAAGCCCACTCCACAACTAGGGAGCTCCCCCACCCAAGGGGCGCATTGGCATCACATAGCCTTTCC
CCGTCCCCCACCCCTTGCTGGCCTGCCCCTCCCTAGCTCCCTATATGGCCATTGCTCTGACTGCCCCCTCCCCTTCCTTACATGG
TCTGGGAGCCCCCTGGCTGATCCTCTACCCTGCCCTTGGCTCCAAGAATGGCCTCAGCGGTCCTAGATGGTGCTAAGGCGACCAA
ATAAGGCAAGGTGGCAGATCAGGGGCCCCCCACCCCTGCCCCCGGCTGCTCCAACTGACCCCGTCCATCAGAGAGCTATAAAGCTG
CGCTCCA
>chicken_-335_to_-24
ACGCCCCGCGTGAAGGCCACCCGGGCCCGACATCTCGGGCAGCGCACCTGGCTTACACTTCCTCGAGGGACCATGA
GGGCCACAGAAGAACTCCGAGCCTCCCCTCCCACCACGTCGGCGGAGGCTCCCTATTTGGCCATGTGGCGGCGGXX
XXXXXXXXXXTCCGCACCTGCCTTAGATGGCCGGACAGCCGCGCCGCCTTGCGCCATTCATGGCCGCGCTGCGCCG
CCATGGCGCCGAGCCGGCCAAATAAGAGAAGGTGGCTGCCCCGGCCCGCGGACCGCGGCCGCCGGGGGCTATAAAG
CGGCAGCTTC
>frog_-355_to_-14
AGTCCCCCTGCACAATTGTGCTGCACCTGTCTACTCCATTTGCAGACCCCTGTGTCTGTGCAAACTATTTCTTTCATT
GTGCTGTTTTTTTTGTCACCCAGCATTACAGACATGCTTTTTTGGGAATCCCTATTTGGCCATCCCTAGTAGTGCTCC
CXXXXXXXXXXXXXXTTTCCATACATGGGCTAAGGGGTCCAAAGACCCTGCCCTCCCCCCTCACCTACTCCATTAA
TGGCTTCTTTGCTTTTCAATGGCCAGAAGCTACCAAATAAGGGCAGGCTGCCTGCCTTTCGGAGCTCCCACTGACTC
CTCAACTCCAGGCAGCGTATAAATTGACAGCTCA
The aim of the practical will be, given these upstream sequences from
orthologous genes, explore different ways to predict transcription
regulatory elements
1. Explore the conservation and
perform predictions of the TATA box
1.1 Create a "Logo" conservation
representation of the motif below
TATA box (TATA binding protein site)
seq1
TATAAA
seq2
TATAGA
seq3
TATAAA
seq4
TATAAA
seq5
GATAAA
seq6
TATAAA
seq7
TATAAA
seq8
TATAAT
* Go to WebLogo
* Paste the alignment above (from seq1 to seq 8)
* Click on "Create Logo"
1.2 Look up TBP properties in the database TRANSFAC
* Go to Transfac
database (login: ub_2006; password: ub2006)
* In TRANSFAC 7.0: choose Search action
* Select the table of Factor
* Enter the factor name TBP
* In "Table field to search in" select Factor Name (NA)
* Select T00794 factor (the human TBP) for inspection
* Try to understand the information available in each field
1.3 Recover the weight matrix that represents the TATA box
from the database TRANSFAC
* Go to Transfac
database (login: ub_2006; password: ub2006)
* In TRANSFAC 7.0: choose Search action
* Select the table of Matrix
* Enter the name TATA
* In "Table field to search in" select (Factor) Name (NA)
* There are two entries: M00252 and M00216
* Select M00252 matrix for inspection
* Keep this matrix for further use (from AC line to //)
1.4 Predict the position of the TATA box in the human alpha-actin cardiac
gene upstream sequence
* Open RSA tools webserver
* On the left frame, click on Pattern matching - patser (matrices)
* Paste the human alpha-actin cardiac gene upstream sequence
* Select Transfac as Matrix Format
* Paste the Transfac TATA matrix (including matrix header) obtained previously
* In Origin select "start" (of the sequence) and press GO
* Check the results. They should look something like this:
map |
type |
id |
strand |
start |
end |
sequence |
score |
ln(P) |
human_-360_to_-15 |
site |
patser |
D |
46 |
60 |
caccCCAGAAAGGGGGAGGggtg |
3.29 |
-5.74 |
human_-360_to_-15 |
site |
patser |
R |
120 |
134 |
tcccCAGCTCCCTATTTGGccat |
1.41 |
-4.61 |
human_-360_to_-15 |
site |
patser |
D |
127 |
141 |
ctccCTATTTGGCCATCCCcctg |
0.46 |
-4.13 |
human_-360_to_-15 |
site |
patser |
R |
157 |
171 |
cctcCCCTTCCTTACATGGtctg |
3.78 |
-6.07 |
human_-360_to_-15 |
site |
patser |
D |
213 |
227 |
ggctCCATGAATGGCCTCGgcag |
4.38 |
-6.51 |
human_-360_to_-15 |
site |
patser |
D |
253 |
267 |
gggaCCAAATAAGGCAAGGtggc |
3.00 |
-5.54 |
human_-360_to_-15 |
site |
patser |
R |
323 |
337 |
ccatCAGCGTTCTATAAAGcggc |
4.50 |
-6.60 |
human_-360_to_-15 |
site |
patser |
D |
328 |
342 |
agcgTTCTATAAAGCGGCCctcc |
0.91 |
-4.35 |
human_-360_to_-15 |
site |
patser |
D |
330 |
344 |
cgttCTATAAAGCGGCCCTcc |
7.67 |
-9.65 |
* Each row in the results table is a putative match to the
TATA box matrix. Which may be the real TATA box?
* To obtain a graphical representation of predictions press "feature map"
* In the RSA-tools - feature map page press "GO"
* Identify the best TATA box prediction in the drawing
2. Run predictions
of all putative transcription factor binding sites in the human
alpha-actin cardiac gene
* Go to PROMO
(Select RESEARCH and then PROMO 3.0).
* Go to SearchSites and input the human alpha-actin
cardiac gene upstream sequence.
* Click Submit.
* Inspect the putative TFBSs. How many different TFBS do we find?
* Go back and to SelectSpecies in the menu at the left. Choose "only
human factors" and "only human sites". Click submit.
* Go to SearchSites and input the human alpha-actin cardiac
gene upstream sequence.
* Click Submit.
* Inspect the putative TFBSs. How many different TFBS do we find?
* Click on some of the factors below "Factors predicted within
a dissimilaity margin less or equal than 15%"
* Check the values of dissimilarity and RE of "TBP".
(TATA-binding protein), by clicking on the colored box (number 33).
* Click on Zoom. How many predictions do we have for
"TBP"? Can we trust them?
3.
Run predictions of all putative transcription factor binding sites in
the four vertebrate alpha-actin cardiac genes
* Go to PROMO
(Select RESEARCH and then PROMO 3.0).
* Go to SelectSpecies in the menu at the left.
* Select "Selected species factors" and
"Selected species sites".
* We will select the phylogenetic group "chordata". To do so we
click below on the arrow for "all species", then we select "eukaryota",
then "animals", then "chordata". We perform this for "Factors of" and
"Sites of".
* Go to MultiSearchSites in the menu at the left.
* Paste the 4 sequences of the cardiac alpha-actin gene.
Select "Sites found in 1 or more sequences" to see all the predictions.
Click Submit.
* Inspect the putative TFBSs. How many different TFBS do we find?
* As these are orthologous sequences it is likely that functional
sites are shared. Go back and select "Sites found in all sequences".
* Inspect the putative TFBSs. How many different TFBS do we find?
* Go back and change the dissimilarity cut-off to 5.
(maximum percent dissimilarity allowed)
* Inspect the putative TFBSs. How many different TFBS do we find?
* Click on some of the factors below "Factors predicted in 4 or
more input sequences within a dissimilarity margin less or equal than 5 %".
* Observe the differences in RE between SRF (box number 32),
which shows very low RE values, with those of GR-beta (box number 1),
which are much higher.
* Annotate the approximate position of the SRF predictions
on the sequences.
4.
Check the experimentally-validated sites on these sequences
* Search for known TFBS in these sequences in
the Catalog of
Muscle-specific Regulatory Elements
* Go to Table of Contents.
* Go to Actin, Alpha-Cardiac.
* Go to "Transcription factor binding sites".
* Go down the page to see binding sites in the four alpha-actin
cardiac gene upstream sequences.
* Note that SRF is conserved in the 4 sequences. Compare with the
results obtained in the previous sections with TFBS prediction programs.
Additional files:
TATA matrix from TRANSFAC
PROMO chordata tree