SCANNER: TRANSCRIPTIONAL FACTORS MOTIFS MULTIALIGNMENT DOCUMENTATION.

Options.

Scanner gives the user different using options:

-h: to receive information about the usage of the program. Type 1 after "-h" for information in Spanish, 2 for Catalan, 3 for English. This option must be followed by a number.

-v: to receive information about the program status while it's running.

-m: to receive information about the matrix/es name, dimensions and consensus sequence.

-s: to receive information about the sequence/s name, length, G+C.

Requirements.

The input files must follow some basic structural requierements.
Examples of sequences on FASTA format:
>name of seq 1
GTACGTACGTATCGTACGATCGATGC
TATAAGCATGCTAGCTGCAGCAGCAC
CTACGAT
>name of seq 2
CACTGCTAGCTTACGACGTCGATTAT
ACAGTGGCATCACTACGCGTAC
>name of seq 3
TAGTACTCGTAACGATAGCTACGACT
AGAGCGCG

Examples of matrixes on compatible format:
# TATA box
01 61 145 152 31 S
02 16 46 18 309 T
03 352 0 2 35 A
04 3 10 2 374 T
05 354 0 5 30 A
06 268 0 0 121 A
07 360 3 20 6 A
08 222 2 44 121 W
09 155 44 157 33 R
10 56 135 150 48 N
11 83 147 128 31 N
12 82 127 128 52 N
13 82 118 128 61 N
14 68 107 139 75 N
15 77 101 140 71 N
# GC box
01 102 40 50 82 N
02 97 31 112 34 R
03 50 6 154 64 G
04 67 1 206 0 G
05 0 0 274 0 G
06 2 0 272 0 G
07 54 170 0 50 C
08 46 1 224 3 G
09 1 3 222 48 G
10 79 0 171 24 G
11 23 17 192 42 G
12 0 166 35 73 C
13 20 86 52 116 K
14 40 24 109 101 K
Back to Homepage

Programming details.

This program allows you to identify binding sites of transcriptional factors using weight matrixes.
- BASIC INSTRUCTIONS:
There are some options that the user should read before using it.
You must download it!
You must install it in a OS compatible with PERL.
But, how does it work? There are several steps:

Open the sequence/s file.(1)
- This file is splited sequence by sequence into an array.
- Every element of this array (a sequence) is processed:

- Delete the name (is saved in another array, which accumulates all the names).
- Split the sequence line by line, join all the elements of this array again (so we delete the line change character).
- Every rearranged sequence is an element of a new array that we will be using to score.

Open the matrix/es file.(2)
- This file is splited matrix by matrix into an array.
- Every element of this array (a matrix) is processed:

- Delete the name (is saved in another array, which accumulates all the names), the positions and the consensus (saved also in a new array).
- Split row by row, and each row is splited again as a new array. - Every element is operated to find the log-likelihood values. We use the expression:
Log-likelihood value= ln [(a/sum)/p]
a = old matrix value.
sum = addition of the whole row values.
p = 'a priori' probability of any nucleotide (we take 0.25).
- All these new values are saved in a three-dimensioned array (the third dimension is to determine the matrix they come from).

Sequence candidates scoring.(3)
- We use the last sequences array arranged in the first step.
- We split the first element of this array (the first sequence) nucleotide by nucleotide.
- Comparison to the matrix:
- The program draws a sliding window starting on the first position of the sequence. The width of this sliding window is the matrix number of rows.
- Every position of the sequence incluided in the current window is compared with the same matrix position and a log-likelihood score is assigned to the position according to the nucleotide.
- The total window score is the addition of each single position score.
- Sequence candidates filtering:(4)if the current window score is higher than the threshold, the content of the window is considered a binding site candidate.
- This process is repeated moving the sliding window one position at a time (5), until the program cannot draw any more window on the sequence.
- This comparison is repeated for all the matrices (6) on each sequence (7).

Fig.2: Scheme showing the SCANNER working process. For more details follow numbers on the text.

Back to Homepage

By: Guiomar Solanas & Alex Vendrell. Universitat Pompeu Fabra, March 2002.
For further information... Send an e-mail to the authors.