Saimiri boliviensis

1. METHODS SUMMARY

Our aim was to search for the already known selenoprotein families in the Saimiri boliviensis genome. Because the synthesis of selenoproteins requires a specific machinery, we also searched for the proteins of the machinery. In both cases, we used an homology-based approach.

In order to find the putative regions coding for selenoproteins and selenoprotein homologues, we used gene prediction based on homology complemented with SECIS element prediction. Because of the phylogenetic proximity between S. boliviensis and Homo sapiens, we used the human selenoproteins as queries to search for regions encoding homologous proteins in the S. boliviensis genome. First, BLAST was used to search for the genomic regions with similarity (after translation) to the human selenoproteins. Hits were then extended, yielding subsequences that were used for gene prediction with exonerate and GeneWise. T-Coffee was then used to align the predictions with their respective queries. The incorporation of Sec into selenoproteins requires a SECIS element. Therefore, in order to find the SECIS elements corresponding to the predicted selenoproteins, we used SECISearch to predict putative SECIS elements. Besides, we also used blast to find sequences homologous to the human SECIS elements. Finally, Selenoprofiles was used to confirm our results.

The search of the machinery was also done using the human machinery proteins as queries and using blast followed by exonerate and GeneWise to predict the homologues. In this case, we did not search for SECIS elements because these proteins do not contain Sec (except for SPS2, which was analysed together with the other selenoproteins).

2. GENOME

The Saimiri boliviensis boliviensis genome (GenBank Assembly ID GCA_000235385.1) was sequenced in 2011 and can be downloaded from Assembly database. It is assembled in scaffolds.

Back to top

3.QUERIES

Because of the phylogenetic proximity and the availability of the sequences, we used the human proteins and SECIS elements as queries for our analysis. We also searched for four extra selenoprotein families which are not found in mammals.

Selenoproteins: we used the human selenoproteins found in SelenoDB, a database specifically dedicated to selenoprotein genes and proteins (human sequences). In order to extend our analysis, we also searched for four selenoprotein families that are not found in humans nor in SelenoDB: Disulfide bond formation protein A (DsbA), from the tunicate Ciona intestinalis (NCBI reference sequence NP_001177288.1), and Fep15 (NP_001182373.1), Selenoprotein L (NP_001177311.1) and Selenoprotein J (NP_001180398.1), from the fish Danio rerio . These queries were downloaded from Protein database (non-human sequences).
SECIS elements: human SECIS elements were downloaded from SelenoDB (SECIS elements).
Machinery: eEFSec, SBP2, SPS1 and SPS2 were downloaded from SelenoDB. Pstk, Secp43 and SecS were downloaded from Protein database (accession numbers NP_699167.2, NP_060316, Q9HD40.2, respectively) (machinery sequences).

4. BLAST: FINDING GENOMIC REGIONS WITH HOMOLOGOUS SEQUENCES

BLAST (Basic Local Alignment Search Tool) is an algorithm that searches for local similarity between a query sequence and a sequence database. There are many programs of BLAST. The selection of a specific program depends on the query, the database and the aim of the search.

In order to find genomic regions encoding proteins homologous to the known selenoproteins and the machinery, we used tblastn. This program searches for similarity between a protein query and a nucleotide database translated into the six reading frames. An important parameter in blast is the expectation value (E-value) of a hit, which reflects the chances of finding the hit by chance. Therefore, the lower the E-value, the better. For the human queries, multiple hits were obtained with a wide range of E-values, including E-values as low as 2x10^-120, consistent with the phylogenetic proximity between humans and S. boliviensis. We used an E-value cut-off of 0.001 (hits with higher e-values were discarded), and all the reported hits were used for further analysis. For the non-human selenoproteins, no E-value cut-off was used.

We also searched for sequences with similarity to the human SECIS elements found in SelenoDB. For this, because they are nucleotide sequences, we used blastn, which searches for similarity between a nucleotide sequence and a nucleotide database. We also used an E-value cut-off of 0.001.

In order to run blast from the shell, a database from the genome has to be created. This had already been created for us, but can be done with the following command line:

formatdb -i genometoformat.fa -p F -n name.fa

-i: genomic sequence to format
-p: type of file: T if protein, F if nucleotide
-n: name we want to give to the database

Blast can be run with the following command line (parenthesis indicate optionality):

blastall -p program -i query.fa -d database.fa (-o blast_output) (-e 0.001) (-m9)

-p: blast program: tblastn or blastn in our case
-i: query
-d: database
-e: E-value cut-off. Only hits with a smaller E-value will be reported.
-m9: to get the results in a tabular format
-o: name of the file with the output

See blast in our script.

5. FASTAINDEX/FASTAFETCH/FASTASUBSEQ: EXTRACTING THE SUBSEQUENCES WITH THE BLAST HITS

After finding the genomic regions with similarity to the protein queries (blast hits), a genomic region was selected for each blast hit so that it included the hit and the surrounding nucleotides. In case we obtained more than one hit on the same scaffold for a given query, we "merged" the hits so that the subsequence included all the hits and the surrounding nucleotides after the first and last hits. This subsequence was then used for gene prediction using exonerate and GeneWise (the subsequence is necessary in order to reduce the computing time, which would be unreasonably high if we run these programs for the whole scaffold).

Extracting a subsequence from a genome sequence can be done with the program fastasubseq, which is part of the exonerate package. First, we need to index the genome and then get the scaffolds which contain the subsequences to be extracted.

We made a program that generates a file with the scaffolds that will be needed (forfastafetch.pl)

The scaffolds can be retrieved in two steps:

fastaindex genome.fa genome.index

fastafetch genome.fa genome.index nameofthescaffold > scaffold.fa

Then we can extract the subsequence. For this, we need the name of the scaffold from which the subsequence must be extracted, the start position and the length of the sequence to extract. H. sapiens and S. boliviensis are closely-related species and the protein sequences are expected to be similar. However, proteins are rarely reported as a unique blast hit because they are usually encoded by more than one exon and different exons are reported as different blast hits. Therefore, for a given query and a given scaffold, the smallest and biggest position in the scaffold for the blast hits must be identified. Because it might be that some exons are less conserved and therefore not reported as blast hits, we wanted to reduce as much as possible the possibility of omitting these sequences from the subsequence. Therefore, we extended 150,000 nucleotides before the smallest position detected by blast and after the biggest one, and extracted the subsequence in between. We made a program that calculated the start position and the length (forfastasubseq.pl).

The subsequences can be extracted with the following command line:

fastasubseq scaffold.fa startposition length > subsequence.fa

See fastaindex, fastafetch and fastasubseq in our script.

6. EXONERATE: GENE PREDICTION (1)

Exonerate is a tool for sequence alignment. In one of its modalities, it aligns a protein sequence to a translated DNA sequence, allowing for introns and therefore predicting the structure of a gene encoding a similar protein. We used exonerate to align the protein queries to the DNA subsequences extracted in the previous step, so that the similar proteins encoded in the genome were predicted.

Exonerate can be run using the following command line:

exonerate -m p2g --showtargetgff -q proteinquery.fa -t subsequence.fa > output.gff

-m p2g: mode protein to genome
--showtargetgff: includes the gene prediction in GFF format. We used this output for further analysis.
-q: query (protein)
-t: subsequence target DNA

Then, the GFF output is used to get the sequence of the predicted cDNA in fasta format, which will be then translated.

For this, first a file is made with the information for the exons:

cat output.gff | egrep -w exon > output2.gff

Then the fasta sequence is obtained with the program fastaseqfromGFF.pl:

fastaseqfromGFF.pl subsequence.fa output2.gff > output.fa

Finally, fastatranslate, one of the programs included in the exonerate package, is used to translate the cDNA sequence.

fastatranslate -F 1 output.fa > translated.fa

-F: 1-6. This indicates the frame in which the sequence will be translated (frames 1 to 6). Because the cDNA sequence should start with the first codon, the frame +1 should be the appropriate.

See exonerate in our script.

7. GENEWISE: GENE PREDICTION (2)

GeneWise uses a combination of gene prediction model and a protein homology model to compare a protein sequence to a DNA sequence and, whenever existing, predict the structure of the gene (introns and exons) that encodes for a similar protein. It is similar to exonerate but it does not use heuristics (Birney et al, 2004).

Genewise can be run with the following command line:

genewise -pep -pretty -cdna -gff (-trev) proteinquery.fa subsequence.fa > output.gff

-pep: to show the protein translation
-pretty: to show the pretty ASCII alignment
-cdna: to show the predicted cDNA
-gff: to show the gene prediction in GFF format
-trev: to indicate that it must search the reverse strand. Alternatively, we could use the option -both, to search both strands.

See GeneWise in our script.

8. T-COFFEE: ALIGNING THE PREDICTED SEQUENCES TO THE QUERIES

After predicting the similar proteins with exonerate and GeneWise, each sequence is aligned to its respective query with T-Coffee. This is a multiple sequence alignment package, which, therefore, allows to align two or more sequences.

T-Coffee can be run with the following command line:

t_coffee sequence1.fa sequence2.fa

See T-Coffee in our script: for the exonerate predictions and for the GeneWise predictions.

9. SECISearch: SECIS PREDICTION

The recoding of the UGA codon into Sec requires a SECIS element in the 3'UTR of the mRNA. Therefore, to further characterize the predicted proteins belonging to selenoprotein families (not for the machinery, except for SPS2), we searched for SECIS elements around the predicted proteins. For this, we used SECISearch and blast.

SECIS elements have conserved features in the primary and secondary structure. SECISearch is a program that predicts SECIS elements based on this conservation, together with an estimation of the free energy of the predicted secondary structure, which reflects the stability of the structure (Kryukov et al, 2003).

SECISearch can be run on the command line for any sequence for which SECIS elements must be predicted. We used the same subsequences as for exonerate and GeneWise.

SECISearch.pl sequence

The conservation exhibited by SECIS elements is quite loose, making SECIS prediction challenging. SECISearch predictions include a number of false positives, and some SECIS elements might be missed. It is known that SECIS elements of mammalian orthologous selenoproteins exhibit sequence similarity (Kryukov et al, 2003). Therefore, to complement the SECIS prediction, we used blast to identify genomic regions with similarity to the human SECIS elements, as previously explained.

See SECISearch in our script.

10. SELENOPROFILES

Selenoprofiles is a computational pipeline to identify all members of a protein family encoded in a target genome sequence. It was developed to identify selenoproteins. It uses a similar approach to ours. The main difference is that the initial queries are alignments of selenoprotein families instead of a single protein. From this alignments, a profile is built and used for finding homologous regions with psitblastn (this modality of blast compares a profile (position specific scoring matrix) to a nucleotide database translated into the six frames). Subsequently, exonerate and GeneWise are used to predict candidate genes. The parameters used for running these programs are slighlty different to ours. For instance, the scoring matrix is modified so that the extension of the alignment over a TGA codon in the target genome is favoured when it aligns with selenocysteine, cysteine, arginine and threonine. Predicted candidates are then filtered on the basis of a minimum spanning length in relation to the profile alignment. SECISearch is then used to search for SECIS elements. Finally, results are refiltered (Mariotti et al, 2010).

We executed Selenoprofiles with the following command line. We searched for all the eukaryotic selenoproteins.

Selenoprofiles results_folder -t /cursos/BI/genomes/project_2013/Saimiri_boliviensis/genome.fa -s "Saimiri_boliviensis" -p eukaryotic

11. SCRIPTS

Here we show a script, which we call script1, which allows to automatically perform all the steps for a given set of protein queries (e.g. selenoproteins, machinery). This script uses three perl programs: extractu.pl, forfastafetch.pl, forfastasubseq.pl.

Script1.sh

#!/bin/bash


### First we have already created a folder named proteins, which contains the file with the 
queries (humanselenoprot_prot.fa) and the program extractu.pl.

### We also have a folder named fastafetch, with the programs forfastafetch.pl and forfastasubseq.pl.

### BLAST. We start in our "home" folder. We create a folder named blast. 
We copy the file containing the queries (humanselenoprot_prot.fa) and run tblastn in it. We set 
the-m9 option and filter out the hits with an e-value smaller than 0.001.

mkdir blast

cd blast

cp ../proteins/humanselenoprot_prot.fa .

export PATH=/cursos/BI/bin/ncbiblast/bin:$PATH #In order to run the different programs of the script, we need to export their paths.
blastall -p tblastn -i humanselenoprot_prot.fa -d /cursos/BI/genomes/project_2013/Saimiri_boliviensis/genome.fa -m9 -e 0.001 -o blast

### We make a file without the lines starting with "#".

egrep -v "#" blast > ../inputfromblast.txt

### Next, Us in protein sequences are replaced by Xs.  For this, the programs extractu.pl
is required in the proteins folder.Then, individual protein sequences are extracted into indindividual files.
We do this with the programs fastaindex and fastafetch, as we will do for the genomic sequences.

cd ../proteins
./extractu.pl < humanselenoprot_prot.fa > protnou.fa

export PATH=/cursos/BI/soft/exonerate/i386/bin:$PATH

fastaindex protnou.fa protnou.index

for protein in `cut -f 1 ../inputfromblast.txt | uniq` ; do {
fastafetch protnou.fa protnou.index $protein > "${protein}.fa"
}
done



### In order to run fastafetch, first we use the program forfastafetch.pl to get a file with the queries
and the names of the scaffolds we want to retrieve.

cd ../fastafetch
./forfastafetch.pl < ../inputfromblast.txt > protein_genomic.txt

cut -f 2 protein_genomic.txt > inputforfastafetch.txt #This makes a file with the names of the 
scaffolds.

### Then we run FASTAFETCH (first fastaindex).

export PATH=/cursos/BI/soft/exonerate/i386/bin:$PATH
fastaindex /cursos/BI/genomes/project_2013/Saimiri_boliviensis/genome.fa genome.index

for name in `cat inputforfastafetch.txt`; do {

fastafetch /cursos/BI/genomes/project_2013/Saimiri_boliviensis/genome.fa genome.index $name > "${name}.fa"

} done  

### We change the names of the scaffolds to make the handling of the files easier. Each scaffold
is named with the name of the protein and the hit. When two or more proteins have hits on the 
same scaffold, a copy is generated for each protein.

nom=a
while read protein genome
do
  
if [ "$nom" != "$protein" ]; then
		num=1
		nom=$protein
		cp "${genome}.fa" "${protein}_${num}_genomic.fa"
      else
		num=$[$num+1]
		cp "${genome}.fa" "${protein}_${num}_genomic.fa"
fi
done < protein_genomic.txt

### Then we generate a file with the information we need for running fastasubseq with the program
forfastasubseq.pl. It is an extension of the program forfastafetch.pl and in this case the name 
of the scaffold is the new name.

./forfastasubseq.pl < ../inputfromblast.txt > inputforfastasubseq.txt

### Then we run fastasubseq. First we make a folder named subsequences. 

mkdir ../subsequences

while read query genomic start length orientation
do

fastasubseq $genomic $start $length > ../subsequences/"${genomic}_exonerate.fa" 2> error.temp
error=`cat error.temp` #this is to ammend those cases in which the length goes beyond the end of the scaffold, which gives an error.
if [ "$error" != "" ]; then
   final=`cat error.temp | egrep "before end" | cut -d '(' -f 2 | sed 's/)//'`
   length=$(( $final - $start))
   fastasubseq $genomic $start $length > ../subsequences/"${genomic}_exonerate.fa" 
fi
done < inputforfastasubseq.txt


### Then we run exonerate (from home).

cd ..
mkdir outputexoneratealign
mkdir outputexonerate
mkdir multiplepredictions

export PATH=/cursos/BI/soft/exonerate/i386/bin:$PATH

export PATH=/cursos/BI/bin:$PATH

while read query genomic start length orientation
do

genomicexonerate="${genomic}_exonerate.fa"

exonerate -m p2g --showtargetgff -q ./proteins/$query -t ./subsequences/$genomicexonerate > outputexoneratealign/"${genomicexonerate}.gff"

done  < ./fastafetch/inputforfastasubseq.txt

###Then we transform the exonerate output into fasta and translate the sequence. For that, 
we need to take into account that exonerate might give more than one prediction. In this
case, we only consider those predictions in the same orientation as the blast hit.

ls outputexoneratealign > listexonerate.txt

for name in `cat listexonerate.txt`; do {
subsequence=`echo $name | sed 's/\.gff//'` 
   
numprediction=`egrep "C4 Alignment" ./outputexoneratealign/$name | uniq -c | cut -d "C" -f 1`

if [ $numprediction == 1 ]; then 
	cat ./outputexoneratealign/$name | egrep -w exon > ./outputexonerate/"${name}.gff"

	fastaseqfromGFF.pl ./subsequences/$subsequence ./outputexonerate/"${name}.gff" > ./outputexonerate/"${name}fromgff.fa"  

else
	i=0
	region1=no
	region2=no
	while read file #read the file line by line
	do
  	if [ $region1 = "no" ]; then
      		field=`echo $file | cut -d ":" -f 1` 

      		if [ "$field" = "Target range" ]; then #here is where exonerate gives the start and ending position of the prediction
		    start="`echo $file | cut -d ":" -f 2 | cut -d "-" -f 1 | cut -d " " -f 2`" 			
		    end="`echo $file | cut -d ">" -f 2 | cut -d " " -f 2`" 
	    
	  		if [ "$start" -lt "$end" ]; then
	  		    orientation="+"

	  		else
	  		    orientation="-"

	  		fi

			hit=`basename $name _exonerate.fa.gff`; 
			orientationblast=`cat ./fastafetch/inputforfastasubseq.txt | egrep "$hit" | cut -f 5`
			echo orientationblast $orientationblast
	  		if [ "$orientationblast" = "$orientation" ]; then
             		     #then do the other steps
	     			region1=yes

	  		fi
		fi
 	fi
	if [ "$region1" = "yes" ]; then
		    
	     	if [ "$region2" = "no" ]; then
			 compare=`echo $file | cut -c 7-11`
			 
			if [ "$compare" = "START" ]; then
		     		i=$[$i + 1]
		     		region2=yes
		 	fi
	      
	     	fi

	     	if [ "$region2" = "yes" ]; then

			 echo "$file" >> ./multiplepredictions/"${name}${i}"
 
			 compare=`echo $file | cut -c 7-9`
     
	  		 if [ "$compare" = "END" ]; then
		    		 region2=no
		     		region1=no
   
			 fi  

	     	 fi 
     
	fi       
	done < ./outputexoneratealign/$name

fi
    
  
}

done

ls multiplepredictions > listgff.txt 

for file in `cat listgff.txt` ; do {

subsequence=`echo $file | sed 's/\.gff[0-9]//'` 


cat ./multiplepredictions/$file | egrep -w exon > ./outputexonerate/"${file}.gff"

fastaseqfromGFF.pl ./subsequences/$subsequence ./outputexonerate/"${file}.gff" > ./outputexonerate/"${file}fromgff.fa"  

} done  

# We make a file with the names of the fasta files generated in the previous step. These files 
contain the exonerate predictions that will be translated next.

ls outputexonerate | egrep "fromgff" > ./outputexonerate/totranslate.txt

mkdir translated

cd outputexonerate

export PATH=/cursos/BI/soft/exonerate/i386/bin:$PATH

for sequence in `cat totranslate.txt` ; do {
i=1
while [ $i -lt 4 ] #although we expect the good translation will be for the frame 1, we also consider the frames 2 and 3
do
fastatranslate  -F $i $sequence > ../translated/"${sequence}_${i}_translated.fa"
i=$[i+1]
done
}
done 


### T-COFFEE. Exonerate predictions and queries are aligned using T-COFFEE. 


cd ..
mkdir tcoffeeexonerate
cd tcoffeeexonerate
ls ../translated > namestcoffeeexonerate.txt

export PATH=/cursos/BI/bin:$PATH 

for sequence in `cat namestcoffeeexonerate.txt` ; do {

protein=`echo $sequence | sed 's/_[0-9][0-9]*_genomic\.fa_exonerate\.fa\.gff[0-9]*fromgff\.fa_[0-9]_translated\.fa//'
`

t_coffee ../translated/$sequence ../proteins/"${protein}.fa"
} done

cd ..

### SECISEARCH

mkdir secis
cd secis
ls ../subsequences > subsequences.txt

export PATH=/cursos/BI/bin:$PATH
while read subseq
do
SECISearch.pl ../subsequences/$subseq
done < subsequences.txt

cd ..

### GENEWISE preparation.

ls subsequences > subsequencesgenewise.txt

mkdir genewise

cd genewise

mkdir outputgenewise
mkdir outputgenewiserev

###First we run genewise for the forward strand.

for subsequence in `cat ../subsequencesgenewise.txt`; do {
query=`echo $subsequence | sed 's/_[0-9][0-9]*_genomic\.fa_exonerate\.fa//'`
genewise -pep -pretty -cdna -gff ../proteins/${query}.fa ../subsequences/$subsequence > ./outputgenewise/"${subsequence}_genewise.gff"
} done

###Then we run genewise for the reverse strand.

for subsequence in `cat ../subsequencesgenewise.txt`; do {
query=`echo $subsequence | sed 's/_[0-9][0-9]*_genomic\.fa_exonerate\.fa//'`
genewise -pep -pretty -cdna -trev -gff ../proteins/${query}.fa ../subsequences/$subsequence > ./outputgenewiserev/"${subsequence}_genewiserev.gff"
} done


###Then we need to get the protein prediction from the genewise output in order to align it with
the query using T-Coffee.

ls outputgenewise | egrep "genewise.gff" > namesgenewise.txt
ls outputgenewiserev | egrep "genewiserev.gff" > namesgenewiserev.txt

cd outputgenewise
cp ../namesgenewise.txt .

for name in `cat namesgenewise.txt` ; do {
fet=false
region=no
while read file 
do
if [ "$fet" = "false" ]; then 

    if [ $region = "no" ]; then
	compare="`echo $file | cut -c 1-3`"
 
	if [ "$compare" = ">gi" ]; then
	    region=yes
 	    i=0
	fi

    else

	compare="`echo $file | cut -c 1-2`"
	if [ $compare != "//" ]; then
	    i=$[i + 1]
	else
 
	  fet=true
	fi

    fi  
fi
done < $name

cat $name | egrep "pep" -A $i > ${name}_fortcoffee.fa

} done

cd ../outputgenewiserev
cp ../namesgenewiserev.txt .

for name in `cat namesgenewiserev.txt` ; do {
fet=false
region=no
while read file
do
if [ "$fet" = "false" ]; then

    if [ $region = "no" ]; then
        compare="`echo $file | cut -c 1-3`"

        if [ "$compare" = ">gi" ]; then
            region=yes
            i=0
        fi

    else

        comparar="`echo $file | cut -c 1-2`"
        if [ $comparar != "//" ]; then
            i=$[i + 1]
        else

          fet=true
        fi

    fi
fi
done < $name

cat $name | egrep "pep" -A $i > ${name}_fortcoffee.fa

} done

cd ..


mkdir tcoffeegenewise
mkdir tcoffeegenewiserev

ls outputgenewise | egrep "fortcoffee.fa" > ./tcoffeegenewise/namestcoffee.txt
ls outputgenewiserev | egrep "fortcoffee.fa" > ./tcoffeegenewiserev/namestcoffeerev.txt

cd tcoffeegenewise

export PATH=/cursos/BI/bin:$PATH 

for sequence in `cat namestcoffee.txt` ; do {

protein=`echo $sequence | sed 's/_[0-9][0-9]*_genomic\.fa_exonerate\.fa_genewise\.gff_fortcoffee\.fa//'`

t_coffee ../outputgenewise/$sequence ../../proteins/"${protein}.fa"
} done

cd ../tcoffeegenewiserev

for sequence in `cat namestcoffeerev.txt` ; do {

protein=`echo $sequence | sed 's/_[0-9][0-9]*_genomic\.fa_exonerate\.fa_genewiserev\.gff_fortcoffee\.fa//'`

t_coffee ../outputgenewiserev/$sequence ../../proteins/"${protein}.fa"
} done

extractu.pl

#!/usr/bin/perl -w

use strict;

#This program is used to substitute the seleneocysteines (U) in the protein queries by X.

my @line;
my $i;
my $newline;


while (<STDIN>){
chomp($_);
$i = 0;
	@line = split ("", $_);
		if ($line[$i] eq ">"){
		$newline = join ("", @line);
		}
		else {
		 while ($i < scalar(@line)){
		 if ($line[$i] eq "U"){

			$line[$i] = "X";
		 }
		 $i = $i + 1;
		 }
		$newline = join ("", @line);	
}
print "$newline\n";
}

forfastafetch.pl

#!/usr/bin/perl -w

use strict;

#this program generates a file with the names of the queries and the scaffolds for which we got 
blast hits. 


     
my @linia;
my $a;
my $human;#protein name
my $nom;#scaffold name
my @output;
my $humancomparacio;

$human = "a";
$nom = "a";
$inici = 1;
$fi = 1;
$humancomparacio = "a";

while (<STDIN>){ 
chomp($_);
$a = $_;
@linia = split ("\t", $a);
if ($linia[0] ne $humancomparacio) {#first line or new protein
					if ($humancomparacio ne "a") {#if not first line
					    
					    @output = join("\t",$human, $nom);
					    print "@output\n";
					}
		      $humancomparacio=$linia[0];
		      $human = $linia[0];
		      $nom = $linia[1];
		
		
		      
				    }
else{#it's still the same protein
	 if ($linia[1] ne $nom){#new scaffold
				    @output = join("\t", $human, $nom);
				    print "@output\n";

				    $human = $linia[0];
				    $nom = $linia [1];

				}
	 

	}
}

@output = join("\t", $human, $nom);
print "@output\n";

forfastasubseq.pl

#!/usr/bin/perl -w

use strict;

#This program uses the blast tabular output (without the lines starting by #) and produces a file
with five columns:query name, scaffold name, start for fastasubseq, length for fastasubseq,
gene orientation. 


my @linia;
my $a;
my $human;#protein name
my $nom;#scaffold name
my $inici;#start
my $fi;#end
my @output;
my $length;
my $numero;#scaffold counter
my $humancomparacio;#variable with the name of the protein query. It is used to know whether 
every line refers to the same protein as the previous line or a new one.
my $orientacio;#orientation
$human = "a";
$nom = "a";
$inici = 1;
$fi = 1;
$humancomparacio = "a";
while (<STDIN>){ 
chomp($_);
$a = $_;
@linia = split ("\t", $a);
if ($linia[0] ne $humancomparacio) {#the first file line or new protein
					if ($humancomparacio ne "a"){ #unless it is the first line
					    $length = ($fi - $inici + 300000);
					    $inici = ($inici - 150000);
					    if ($inici < 0) { $inici = 1;}
					    $nom = $human."_".$numero."_"."genomic.fa";
					    $human = $human.".fa";
					    
					    @output = join("\t", $human, $nom, $inici, $length, $orientacio);
					    print "@output\n";
					}
		      $numero = 1;
		      $human = $linia[0];
		      $humancomparacio = $linia[0];
		      $nom = $linia[1];
		      if ($linia[8] < $linia[9]){
			  $inici = $linia[8];
			  $fi = $linia[9];
			  $orientacio = "+";
		      }
		      else {
			  $inici = $linia[9];
			  $fi = $linia[8];
			  $orientacio = "-";
		      }
		      
		  }
else{#the protein is not new

	 if ($linia[1] ne $nom) {#new scaffold of the same protein
					#then we print what we have from the previous scaffolds
					$length = ($fi - $inici + 300000);
					$inici = ($inici - 150000);
					if ($inici < 0) { $inici = 1;}
					$nom = $human."_".$numero."_"."genomic.fa";
					$human = $human.".fa";

					@output = join("\t", $human, $nom, $inici, $length, $orientacio);
					print "@output\n";
					$numero = $numero + 1;#we need to add 1 because it is a new scaffold
					$human = $linia[0];
					$nom = $linia[1];
					if ($linia[8] < $linia[9]){
					    $inici = $linia[8];
					    $fi = $linia[9];
					    $orientacio = "+";
					}
					else {
					    $inici = $linia[9];
					    $fi = $linia[8];
					    $orientacio = "-";
					}
	}
	 if ($linia[1] eq $nom){ #the protein and the scaffold are the same, there is no need to add anything else
    if ($linia[8] < $linia[9]) {#forward
	if ($linia[8] < $inici) {#substitute the start value if it is smaller
	    $inici = $linia[8];
	}
	if ($linia[9] > $fi){#substitute the end value if it is greater
	    $fi = $linia[9]; 
	}
    }
    else #reverse
    {
	if ($linia[9] < $inici){
	    $inici = $linia[9];
	}
	if ($linia[8] > $fi){
	    $fi = $linia[8];

	}
    }
 }
   
}
}
$length = ($fi - $inici + 300000);
$inici = ($inici - 150000); if ($inici < 0) {$inici = 1;}
$nom = $human."_".$numero."_"."genomic.fa";
$human = $human.".fa";

@output = join("\t", $human, $nom, $inici, $length, $orientacio);
print "@output\n";

#print "$human","$nom","$inici","$fi","\n";

Contents

Script1.sh