Practical 21: Pattern searches in regulatory sequences
Bioinformatics 2006/2007




Introduction

1. Characteristics of gene expression regulatory sequences

Transcription regulation is mediated by transcription factors, many of which recognize DNA motifs in a sequence-specific manner. This binding to DNA motifs may activate the expression of a gene or it may repress it.  The regulation involves interactions between different transcription factors.

Regions upstream of the transcription start site typically contain several regulatory elements and are known as promoters. But in multicellular organisms this scenario can be quite complex and a gene's regulatory sequence can consist of several cys-regulatory modules, discrete regions that drive the expression of the gene under particular conditions or tissues.

Regulatory sequences have the potential to evolve quickly, as mutations in one or a few nucleotides may lead to the loss or acquisition of a new transcription factor binding site, and different cys-regulatory modules can evolve independently.



transcription


Transcription factors contain DNA-binding domains for the interaction with the DNA sequence. The interaction often involves the major groove of the DNA molecule and is  established through hydrogen bonds and van der Waals interactions.

Examples of Regulatory Proteins
 Type
 Abbreviation
 Example
 helix-turn-helix
 HTH
 lac repressor, CRP (CAP)
 basic leucine zipper
 bZIP
 CREB, AP1 (Fos, Jun)
 zinc finger
 zif
 TFIIIA, Gal4


Distribution of eukaryotic transcription factor DNA-binding domains:

euk_TF_domains
 
Zinc finger domain (C2H2):


x x
x x
x x
x x
x x
x x
C H
x \ / x
x Zn x
x / \ x
C H
x x x x x x x x x x

2. Transcription factor binding sites

Transcription factors bind to DNA motifs in regulatory regions in a sequence-specific manner. The motifs are short and variable. For this reason computational searches of DNA motifs are often very unespecific, we will find many instances of the motif just by chance. 

example:
TATA site in the gene promoter

CLUSTAL W (1.82) multiple sequence alignment

seq3 TATAAA 6
seq7 TATAGA 6
seq8 TATAAA 6
seq2 TATAAA 6
seq5 GATAAA 6
seq6 TATAAA 6
seq1 TATAAA 6
seq4 TATAAT 6
***

We can capture this variability using consensus sequences, such as:
[T/G]ATA[A/G][A/T],

or position weight matrices (PWM), such as:


TATA box PWM:

                   1    2    3     4    5    6
          - - - - - - - - - - - - - - - - - - - -
           A    0    8    0    8    7    7
           C    0    0    0    0     0   0
           G    1    0    0    0    1    0
           T    7    0    8    0     0    1
           
 relative frequencies: 

                  1           2       3         4        5           6
          - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
           A    0           1        0        1        0.875    0.875
           C    0           0        0        0        0           0
           G    0.125    0        0        0        0.125    0
           T    0.875    0        1        0        0           0.125

(each cell is Mij)

The use of PWM allows us to perform more specific searches. We will not only be interested in whether a nucleotide can appear in a particular position but also in the frequency at which that is observed.




Motif logos:


TATA_logo



3D structure:

TBP_structure
TATA-binding protein (TBP) bound to the TATA site





Practical: prediction of regulatory motifs in the cardiac alpha-actin gene


We have the following sequences of the upstream region of cardiac alpha-actin gene:

>human_-360_to_-15
CTGCGGAGGACCGAATCCACAGACCATCCAGGGAGCACCCACACCCCAGAAAGGGGGAGGGGTGGGCTGGCGTCAC
TTAGTCTTCCCCTGCCCCCTACCCTTCAGCGCCTGCCCCTCCCCAGCTCCCTATTTGGCCATCCCCCTGACTGCCCCC
TCCCCTTCCTTACATGGTCTGGGGGCTCCCTGGCTGATCCTCTCCCCTGCCCTTGGCTCCATGAATGGCCTCGGCAGT
CCTAGCGGGTGCGAAGGGGACCAAATAAGGCAAGGTGGCAGACCGGGCCCCCCACCCCTGCCCCCGGCTGCTCCAA
CTGACCCTGTCCATCAGCGTTCTATAAAGCGGCCCTCCT
>mouse_-360_to_-12
TTGGAAGGGCTGAAGAGCAATAAGCCCACTCCACAACTAGGGAGCTCCCCCACCCAAGGGGCGCATTGGCATCACATAGCCTTTCC
CCGTCCCCCACCCCTTGCTGGCCTGCCCCTCCCTAGCTCCCTATATGGCCATTGCTCTGACTGCCCCCTCCCCTTCCTTACATGG
TCTGGGAGCCCCCTGGCTGATCCTCTACCCTGCCCTTGGCTCCAAGAATGGCCTCAGCGGTCCTAGATGGTGCTAAGGCGACCAA
ATAAGGCAAGGTGGCAGATCAGGGGCCCCCCACCCCTGCCCCCGGCTGCTCCAACTGACCCCGTCCATCAGAGAGCTATAAAGCTG
CGCTCCA
>chicken_-335_to_-24
ACGCCCCGCGTGAAGGCCACCCGGGCCCGACATCTCGGGCAGCGCACCTGGCTTACACTTCCTCGAGGGACCATGA
GGGCCACAGAAGAACTCCGAGCCTCCCCTCCCACCACGTCGGCGGAGGCTCCCTATTTGGCCATGTGGCGGCGGXX
XXXXXXXXXXTCCGCACCTGCCTTAGATGGCCGGACAGCCGCGCCGCCTTGCGCCATTCATGGCCGCGCTGCGCCG
CCATGGCGCCGAGCCGGCCAAATAAGAGAAGGTGGCTGCCCCGGCCCGCGGACCGCGGCCGCCGGGGGCTATAAAG
CGGCAGCTTC
>frog_-355_to_-14
AGTCCCCCTGCACAATTGTGCTGCACCTGTCTACTCCATTTGCAGACCCCTGTGTCTGTGCAAACTATTTCTTTCATT
GTGCTGTTTTTTTTGTCACCCAGCATTACAGACATGCTTTTTTGGGAATCCCTATTTGGCCATCCCTAGTAGTGCTCC
CXXXXXXXXXXXXXXTTTCCATACATGGGCTAAGGGGTCCAAAGACCCTGCCCTCCCCCCTCACCTACTCCATTAA
TGGCTTCTTTGCTTTTCAATGGCCAGAAGCTACCAAATAAGGGCAGGCTGCCTGCCTTTCGGAGCTCCCACTGACTC
CTCAACTCCAGGCAGCGTATAAATTGACAGCTCA

The aim of the practical will be, given these upstream sequences from orthologous genes, explore different ways to predict transcription regulatory elements


    1. Explore the conservation and perform predictions of the TATA box

    1.1 Create a "Logo" conservation representation of the motif below


     TATA box (TATA binding protein site)

     seq1               TATAAA
     seq2               TATAGA
     seq3               TATAAA
     seq4               TATAAA
     seq5               GATAAA
     seq6               TATAAA
     seq7               TATAAA
     seq8               TATAAT


*    Go to WebLogo

*    Paste the alignment above (from seq1 to seq 8)

*    Click on "Create Logo"


    1.2 Look up TBP properties in the database TRANSFAC

*    Go to Transfac database  (login: ub_2006; password: ub2006)

*    In TRANSFAC 7.0: choose Search action

*    Select the table of Factor

*    Enter the factor name TBP

*    In "Table field to search in" select Factor Name (NA)

*    Select T00794 factor (the human TBP) for inspection

*    Try to understand the information available in each field


    1.3 Recover the weight matrix that represents the TATA box from the database TRANSFAC

*    Go to Transfac database  (login: ub_2006; password: ub2006)

*    In TRANSFAC 7.0: choose Search action

*    Select the table of Matrix

*    Enter the name TATA

*    In "Table field to search in" select (Factor) Name (NA)

*    There are two entries: M00252 and M00216

*    Select M00252 matrix for inspection

*    Keep this matrix for further use (from AC line to //)


    1.4 Predict the position of the TATA box in the human alpha-actin cardiac gene upstream sequence

*    Open RSA tools webserver

*    On the left frame, click on Pattern matching - patser (matrices)

*    Paste the human alpha-actin cardiac gene upstream sequence

*    Select Transfac as Matrix Format

*    Paste the Transfac TATA matrix (including matrix header) obtained previously

*    In Origin select "start" (of the sequence) and press GO

*    Check the results.  They should look something like this:

map type id strand start end sequence score ln(P)
human_-360_to_-15 site patser D 46 60 caccCCAGAAAGGGGGAGGggtg 3.29 -5.74
human_-360_to_-15 site patser R 120 134 tcccCAGCTCCCTATTTGGccat 1.41 -4.61
human_-360_to_-15 site patser D 127 141 ctccCTATTTGGCCATCCCcctg 0.46 -4.13
human_-360_to_-15 site patser R 157 171 cctcCCCTTCCTTACATGGtctg 3.78 -6.07
human_-360_to_-15 site patser D 213 227 ggctCCATGAATGGCCTCGgcag 4.38 -6.51
human_-360_to_-15 site patser D 253 267 gggaCCAAATAAGGCAAGGtggc 3.00 -5.54
human_-360_to_-15 site patser R 323 337 ccatCAGCGTTCTATAAAGcggc 4.50 -6.60
human_-360_to_-15 site patser D 328 342 agcgTTCTATAAAGCGGCCctcc 0.91 -4.35
human_-360_to_-15 site patser D 330 344 cgttCTATAAAGCGGCCCTcc 7.67 -9.65


*   Each row in the results table is a putative match to the TATA box matrix. Which may be the real TATA box?

*   To obtain a graphical representation of predictions press "feature map"

*   In the RSA-tools - feature map page press "GO"

*   Identify the best TATA box prediction in the drawing


    2. Run predictions of all putative transcription factor binding sites in the human alpha-actin cardiac gene

*   Go to PROMO (Select RESEARCH and then PROMO 3.0).

*   Go to SearchSites and input the human alpha-actin cardiac gene upstream sequence.

*   Click Submit.

*   Inspect the putative TFBSs. How many different TFBS do we find?

*   Go back and to SelectSpecies in the menu at the left. Choose "only human factors" and "only human sites". Click submit.

*   Go to SearchSites and input the human alpha-actin cardiac gene upstream sequence.

*   Click Submit.

*   Inspect the putative TFBSs. How many different TFBS do we find?

*   Click on some of the factors below "Factors predicted within a dissimilaity margin less or equal than 15%"

*   Check the values of dissimilarity and RE of "TBP". (TATA-binding protein), by clicking on the colored box (number 33).

*   Click on Zoom. How many predictions do we have for "TBP"? Can we trust them?


    3. Run predictions of all putative transcription factor binding sites in the four vertebrate alpha-actin cardiac genes

*   Go to PROMO (Select RESEARCH and then PROMO 3.0).

*   Go to SelectSpecies in the menu at the left.

*   Select "Selected species factors" and "Selected species sites".

*   We will select the phylogenetic group "chordata". To do so we click below on the arrow for "all species", then we select "eukaryota", then "animals", then "chordata". We perform this for "Factors of" and "Sites of".

*   Go to MultiSearchSites in the menu at the left.

*   Paste the 4 sequences of the cardiac alpha-actin gene. Select "Sites found in 1 or more sequences" to see all the predictions. Click Submit.

*   Inspect the putative TFBSs. How many different TFBS do we find?

*   As these are orthologous sequences it is likely that functional sites are shared. Go back and select "Sites found in all sequences".

*   Inspect the putative TFBSs. How many different TFBS do we find?

*   Go back and change the dissimilarity cut-off to 5. (maximum percent dissimilarity allowed)

*   Inspect the putative TFBSs. How many different TFBS do we find?

*   Click on some of the factors below "Factors predicted in 4 or more input sequences within a dissimilarity margin less or equal than 5 %".

*   Observe the differences in RE between SRF (box number 32), which shows very low RE values, with those of GR-beta (box number 1), which are much higher.

*   Annotate the approximate position of the SRF predictions on the sequences.


    4. Check the experimentally-validated sites on these sequences

*   Search for known TFBS in these sequences in the Catalog of Muscle-specific Regulatory Elements

*   Go to Table of Contents.

*   Go to Actin, Alpha-Cardiac.

*   Go to "Transcription factor binding sites".

*   Go down the page to see binding sites in the four alpha-actin cardiac gene upstream sequences.

*   Note that SRF is conserved in the 4 sequences. Compare with the results obtained in the previous sections with TFBS prediction programs.



Additional files:

TATA matrix from TRANSFAC

PROMO chordata tree