UB_2005

Introduction

1. Characteristics of gene expression regulatory sequences

Transcription regulation is mediated by transcription factors, many of which recognize DNA motifs in a sequence-specific manner. This binding to DNA motifs may activate the expression of a gene or it may repress it. The regulation involves interactions between different transcription factors.

Regions upstream of the transcription start site typically contain several regulatory elements and are known as promoters. But in multicellular organisms this scenario can be quite complex and a gene's regulatory sequence can consist of several cys-regulatory modules, discrete regions that drive the expression of the gene under particular conditions or tissues.

Regulatory sequences have the potential to evolve quickly, as mutations in one or a few nucleotides may lead to the loss or acquisition of a new transcription factor binding site, and different cys-regulatory modules can evolve independently.

Transcription factors contain DNA-binding domains for the interaction with the DNA sequence. The interaction often involves the major groove of the DNA molecule and is established through hydrogen bonds and van der Waals interactions.

Examples of Regulatory Proteins
Type	Abbreviation	Example
helix-turn-helix	HTH	lac repressor, CRP (CAP)
basic leucine zipper	bZIP	CREB, AP1 (Fos, Jun)
zinc finger	zif	TFIIIA, Gal4

Distribution of eukaryotic transcription factor DNA-binding domains:

Zinc finger domain (C2H2):


                                 x  x
                               x      x
                              x        x
                              x        x
                              x        x
                              x        x
                               C      H
                             x   \  /   x
                            x     Zn     x
                             x  /    \  x
                               C      H
                      x x x x x        x x x x x

2. Transcription factor binding sites

Transcription factors bind to DNA motifs in regulatory regions in a sequence-specific manner. The motifs are short and variable. For this reason computational searches of DNA motifs are often very unespecific, we will find many instances of the motif just by chance.

example:

TATA site in the gene promoter

CLUSTAL W (1.82) multiple sequence alignment

seq3               TATAAA 6
seq7               TATAGA 6
seq8               TATAAA 6
seq2               TATAAA 6
seq5               GATAAA 6
seq6               TATAAA 6
seq1               TATAAA 6
seq4               TATAAT 6
                    ***

We can capture this variability using consensus sequences, such as:
[T/G]ATA[A/G][A/T],

or position weight matrices (PWM), such as:

TATA box PWM:

                   1    2    3     4    5    6
          - - - - - - - - - - - - - - - - - - - -
           A    0    8    0    8    7    7
           C    0    0    0    0     0   0
           G    1    0    0    0    1    0
           T    7    0    8    0    0    1

relative frequencies:

                  1           2   3         4        5           6
          - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
           A    0           1        0        1        0.875    0.875
           C    0           0        0        0      0       0
           G    0.125    0        0        0        0.125    0
           T    0.875    0        1        0        0           0.125

(each cell is Mij)

The use of PWM allows us to perform more specific searches. We will not only be interested in whether a nucleotide can appear in a particular position but also in the frequency at which that is observed.

Motif logos:

TATA_logo

3D structure:

TBP_structure

TATA-binding protein (TBP) bound to the TATA site

Practical: prediction of regulatory motifs in the cardiac alpha-actin gene

We have the following sequences of the upstream region of cardiac alpha-actin gene:

>human_-360_to_-15
CTGCGGAGGACCGAATCCACAGACCATCCAGGGAGCACCCACACCCCAGAAAGGGGGAGGGGTGGGCTGGCGTCAC
TTAGTCTTCCCCTGCCCCCTACCCTTCAGCGCCTGCCCCTCCCCAGCTCCCTATTTGGCCATCCCCCTGACTGCCCCC
TCCCCTTCCTTACATGGTCTGGGGGCTCCCTGGCTGATCCTCTCCCCTGCCCTTGGCTCCATGAATGGCCTCGGCAGT
CCTAGCGGGTGCGAAGGGGACCAAATAAGGCAAGGTGGCAGACCGGGCCCCCCACCCCTGCCCCCGGCTGCTCCAA
CTGACCCTGTCCATCAGCGTTCTATAAAGCGGCCCTCCT
>mouse_-360_to_-12
TTGGAAGGGCTGAAGAGCAATAAGCCCACTCCACAACTAGGGAGCTCCCCCACCCAAGGGGCGCATTGGCATCACATAGCCTTTCC
CCGTCCCCCACCCCTTGCTGGCCTGCCCCTCCCTAGCTCCCTATATGGCCATTGCTCTGACTGCCCCCTCCCCTTCCTTACATGG
TCTGGGAGCCCCCTGGCTGATCCTCTACCCTGCCCTTGGCTCCAAGAATGGCCTCAGCGGTCCTAGATGGTGCTAAGGCGACCAA
ATAAGGCAAGGTGGCAGATCAGGGGCCCCCCACCCCTGCCCCCGGCTGCTCCAACTGACCCCGTCCATCAGAGAGCTATAAAGCTG
CGCTCCA
>chicken_-335_to_-24
ACGCCCCGCGTGAAGGCCACCCGGGCCCGACATCTCGGGCAGCGCACCTGGCTTACACTTCCTCGAGGGACCATGA
GGGCCACAGAAGAACTCCGAGCCTCCCCTCCCACCACGTCGGCGGAGGCTCCCTATTTGGCCATGTGGCGGCGGXX
XXXXXXXXXXTCCGCACCTGCCTTAGATGGCCGGACAGCCGCGCCGCCTTGCGCCATTCATGGCCGCGCTGCGCCG
CCATGGCGCCGAGCCGGCCAAATAAGAGAAGGTGGCTGCCCCGGCCCGCGGACCGCGGCCGCCGGGGGCTATAAAG
CGGCAGCTTC
>frog_-355_to_-14
AGTCCCCCTGCACAATTGTGCTGCACCTGTCTACTCCATTTGCAGACCCCTGTGTCTGTGCAAACTATTTCTTTCATT
GTGCTGTTTTTTTTGTCACCCAGCATTACAGACATGCTTTTTTGGGAATCCCTATTTGGCCATCCCTAGTAGTGCTCC
CXXXXXXXXXXXXXXTTTCCATACATGGGCTAAGGGGTCCAAAGACCCTGCCCTCCCCCCTCACCTACTCCATTAA
TGGCTTCTTTGCTTTTCAATGGCCAGAAGCTACCAAATAAGGGCAGGCTGCCTGCCTTTCGGAGCTCCCACTGACTC
CTCAACTCCAGGCAGCGTATAAATTGACAGCTCA

The aim of the practical will be, given these upstream sequences from orthologous genes, explore different ways to predict transcription regulatory elements

1. Explore the conservation and perform predictions of the TATA box

1.1 Create a "Logo" conservation representation of the motif below

     TATA box (TATA binding protein site)

     seq1               TATAAA
     seq2               TATAGA
     seq3               TATAAA
     seq4               TATAAA
     seq5               GATAAA
     seq6               TATAAA
     seq7               TATAAA
     seq8               TATAAT

*    Go to WebLogo

*    Paste the alignment above (from seq1 to seq 8)

*    Click on "Create Logo"

1.2 Look up TBP properties in the database TRANSFAC

*    Go to Transfac database (login: ub_2006; password: ub2006)

*    In TRANSFAC 7.0: choose Search action

*    Select the table of Factor

*    Enter the factor name TBP

*    In "Table field to search in" select Factor Name (NA)

*    Select T00794 factor (the human TBP) for inspection

*    Try to understand the information available in each field

1.3 Recover the weight matrix that represents the TATA box from the database TRANSFAC

*    Go to Transfac database (login: ub_2006; password: ub2006)

*    In TRANSFAC 7.0: choose Search action

*    Select the table of Matrix

*    Enter the name TATA

*    In "Table field to search in" select (Factor) Name (NA)

*    There are two entries: M00252 and M00216

*    Select M00252 matrix for inspection

*    Keep this matrix for further use (from AC line to //)

1.4 Predict the position of the TATA box in the human alpha-actin cardiac gene upstream sequence

*    Open RSA tools webserver

*    On the left frame, click on Pattern matching - patser (matrices)

*    Paste the human alpha-actin cardiac gene upstream sequence

*    Select Transfac as Matrix Format

*    Paste the Transfac TATA matrix (including matrix header) obtained previously

*    In Origin select "start" (of the sequence) and press GO

*    Check the results. They should look something like this:

`map`	`type`	`id`	`strand`	`start`	`end`	`sequence`	`score`	`ln(P)`
`human_-360_to_-15`	`site`	`patser`	`D`	`46`	`60`	`caccCCAGAAAGGGGGAGGggtg`	`3.29`	`-5.74`
`human_-360_to_-15`	`site`	`patser`	`R`	`120`	`134`	`tcccCAGCTCCCTATTTGGccat`	`1.41`	`-4.61`
`human_-360_to_-15`	`site`	`patser`	`D`	`127`	`141`	`ctccCTATTTGGCCATCCCcctg`	`0.46`	`-4.13`
`human_-360_to_-15`	`site`	`patser`	`R`	`157`	`171`	`cctcCCCTTCCTTACATGGtctg`	`3.78`	`-6.07`
`human_-360_to_-15`	`site`	`patser`	`D`	`213`	`227`	`ggctCCATGAATGGCCTCGgcag`	`4.38`	`-6.51`
`human_-360_to_-15`	`site`	`patser`	`D`	`253`	`267`	`gggaCCAAATAAGGCAAGGtggc`	`3.00`	`-5.54`
`human_-360_to_-15`	`site`	`patser`	`R`	`323`	`337`	`ccatCAGCGTTCTATAAAGcggc`	`4.50`	`-6.60`
`human_-360_to_-15`	`site`	`patser`	`D`	`328`	`342`	`agcgTTCTATAAAGCGGCCctcc`	`0.91`	`-4.35`
`human_-360_to_-15`	`site`	`patser`	`D`	`330`	`344`	`cgttCTATAAAGCGGCCCTcc`	`7.67`	`-9.65`

*   Each row in the results table is a putative match to the TATA box matrix. Which may be the real TATA box?

*   To obtain a graphical representation of predictions press "feature map"

*   In the RSA-tools - feature map page press "GO"

*   Identify the best TATA box prediction in the drawing

2. Run predictions of all putative transcription factor binding sites in the human alpha-actin cardiac gene

*   Go to PROMO (Select RESEARCH and then PROMO 3.0).

*   Go to SearchSites and input the human alpha-actin cardiac gene upstream sequence.

*   Click Submit.

*   Inspect the putative TFBSs. How many different TFBS do we find?

*   Go back and to SelectSpecies in the menu at the left. Choose "only human factors" and "only human sites". Click submit.

*   Go to SearchSites and input the human alpha-actin cardiac gene upstream sequence.

*   Click Submit.

*   Inspect the putative TFBSs. How many different TFBS do we find?

*   Click on some of the factors below "Factors predicted within a dissimilaity margin less or equal than 15%"

*   Check the values of dissimilarity and RE of "TBP". (TATA-binding protein), by clicking on the colored box (number 33).

*   Click on Zoom. How many predictions do we have for "TBP"? Can we trust them?

3. Run predictions of all putative transcription factor binding sites in the four vertebrate alpha-actin cardiac genes

*   Go to PROMO (Select RESEARCH and then PROMO 3.0).

*   Go to SelectSpecies in the menu at the left.

*   Select "Selected species factors" and "Selected species sites".

*   We will select the phylogenetic group "chordata". To do so we click below on the arrow for "all species", then we select "eukaryota", then "animals", then "chordata". We perform this for "Factors of" and "Sites of".

*   Go to MultiSearchSites in the menu at the left.

*   Paste the 4 sequences of the cardiac alpha-actin gene. Select "Sites found in 1 or more sequences" to see all the predictions. Click Submit.

*   Inspect the putative TFBSs. How many different TFBS do we find?

*   As these are orthologous sequences it is likely that functional sites are shared. Go back and select "Sites found in all sequences".

*   Inspect the putative TFBSs. How many different TFBS do we find?

*   Go back and change the dissimilarity cut-off to 5. (maximum percent dissimilarity allowed)

*   Inspect the putative TFBSs. How many different TFBS do we find?

*   Click on some of the factors below "Factors predicted in 4 or more input sequences within a dissimilarity margin less or equal than 5 %".

*   Observe the differences in RE between SRF (box number 32), which shows very low RE values, with those of GR-beta (box number 1), which are much higher.

*   Annotate the approximate position of the SRF predictions on the sequences.

4. Check the experimentally-validated sites on these sequences

*   Search for known TFBS in these sequences in the Catalog of Muscle-specific Regulatory Elements

*   Go to Table of Contents.

*   Go to Actin, Alpha-Cardiac.

*   Go to "Transcription factor binding sites".

*   Go down the page to see binding sites in the four alpha-actin cardiac gene upstream sequences.

*   Note that SRF is conserved in the 4 sequences. Compare with the results obtained in the previous sections with TFBS prediction programs.

Additional files:

TATA matrix from TRANSFAC

PROMO chordata tree