1. Introduction
The UGA codon was thought to have a unique meaning as the STOP signal of the genetic code. But after discovery of selenoproteins another meaning has been attributed to this codon as soon as it can incorporate a novel aminoacid called “selenocysteine”(Sec). This newly found aminoacid is not more than a cysteine having selenium instead of sulphur. The condition for selenocysteine to be incorporated is having a Sec insertion sequence (SECIS) in the near 3’ untranslated region (UTR). This sequence forms a loop that enables recruitment of several factors and enzymes that permit Sec incorporation to the nascent polypeptide chain. The functional importance of these proteins remains poorly understood, but it seems that selenocysteine is often located in the enzyme active-sites and is usually essential for its activity. This fact confers an enhanced value to the search and characterization of these novel proteins not only in humans but also in other organisms. 25 selenoproteins have been described so far for humans but we still don’t know if similar proteins are present in insects whom genomes sequentiation is in advanced phase if not completed. In this project we report the presence of proteins similar to those found in humans in some organisms belonging to the family arthropoda. We have focused our study in organisms having the genome sequenced enough such as Anopheles gambiae, Apis mellifera, Drosophyla pseudoscura or others such as Drosophyla melanogaster with completed genome sequentiation. Thus, we have performed similarity search (tBLASTn) followed by exonic structure analysis (exonerate), SECIS structure prediction (SeciSearch) and multiple alignment of our hits (ClustalW). Results of BLAST (hits) showing different scores and conservation levels were submitted to exonerate to further analyse the exonic structure of our putative proteins paying special attention to location of the aminoacids matching with selenocysteine (expressed as a STOP) and its adjacencies. Considering that an in-frame TGA (STOP) signal located not at the end but between other triplets could suggest a UGA coding for selenocysteine. SeciSearch allowed us to identify putative SECIS elements in order to emphasize the information obtained by exonerate analysis or to discard hits that even having TGA inside the exon lack the necessary SECIS to be interpreted as possible selenoproteins. Finally, the multiple alignment was done aimed at knowing the conservation level among our putative proteins, considering that a high conservation suggests that aminoacids are part of a functional protein and keep restrictive to mutations that would possibly affect worsening the activity. 2. Methods SELENOPROTEINS AND ORGANISMS Especies from genus Drosophila:
Other species: PROGRAMMES SIMILARITY SEARCH TBLASTN GENE PREDICTION First of all the use of Geneid for gene prediction was planned. Nevertheless, this turned to be uneffective in our case as soon as our subject insect genomes were not available.Thus, we realised that programs not depending on intrinsec databases were needed.Two options were discussed; Exonerate and Genewise; after following our tutor's advices, the first one was chosen. EXONERATE protein2genome 0.8.2 MULTIPLE ALIGNMENT FASTACHUNK TRANSLATE TOOL CLUSTALW Results are given for each protein in the following order:
SelH tblastn >gnl|Dana-agencourt-040714-asm|contig_3129 1e-07 37% >gnl|Dyak-washu-assembly_040407|Contig124.5 2e-07 36% >0gnl|Dere-agencourt-run1028-asm|contig_8152 4e-07 35% >gnl|Dvir-agencourt-run1029-asm|contig_2170 0.044 25% >gi|27645578|emb|BX072297.1|CNS09RY5 Single read from an extremity of a full-length cDNA clone made from Anopheles gambiae total adult females. 2e-14 37% >gi|21430733|gb|AY119185.1| Drosophila melanogaster SD09114 full insert cDNA 4e-10 33% Gene prediction SECIS elements >gi|27645578|emb|BX072297.1|CNS09RY5 Single read from an extremity of a full-length cDNA clone made from Anopheles gambiae total adult females. 5-PRIME end of clone FK0AAD1AF11 of strain 6-9 of Anopheles gambiae (African malaria mosquito): 640 734 CUUUAAUUUC UUCGAAU CGCCAUGCCGC AUGAC GAAGCCUGAGC AA ACCCCACGUGGGA CUCGAGCCUU UGAA GCUCU GCUGGCG UUUCGGUAGA COVE score: 24.71 (Recommended threshold for COVE: 15) >gi|21430733|gb|AY119185.1| Drosophila melanogaster SD09114 full insert cDNA: 816 917 CAAAACAAAG CGUUCAG CAAUAGAGCCCU AUGAU CGAUGAUUGGC AA AUCCUCUCGAGGA ACCGAUCGUUG AGAA CCCCUUUGCC UUUGUUG AUCGCUCAAU COVE score: 21.48 (Recommended threshold for COVE: 15) Multiple alignment SPS2 SPS2 human selenoprotein matched with high homology in many of the insect genomes submitted to tblastn. Here we report the discovery of two new selenoprotein candidates belonging to the species: D.yakuba and D.simulans. All the other insect species were found to have an "Arg" aminoacid at the position of the "Sec" of humans. tblastn >gnl|Dvir-agencourt-run1029-asm|contig_641 e-112 60% >gnl|Dmov-agencourt-run0811-asm|contig_3459 e-112 60% >gnl|Dana-agencourt-040714-asm|contig_1299 e-111 60% >gnl|Dere-agencourt-run1028-asm|contig_125 e-111 59% >gnl|Dyak-washu-assembly_040407|Contig11.8 e-111 59% >3 type=chromosome; loc=3:1..19738957; ID=3; release=r1.03; species=dpse e-110 59% >gnl|Dana-agencourt-040714-asm|contig_30531 2e-97 58% >XR_group8 type=chromosome; loc=XR_group8:1..9190824; ID=XR_group8; release=r1.03; species=dpse 2e-73 43% >4_group4 type=chromosome; loc=4_group4:1..6604331; ID=4_group4;release=r1.03; species=dpse 2e-68 42% >XL_group1e type=chromosome; loc=XL_group1e:1..12499574; ID=XL_group1e; release=r1.03; species=dpse 2e-54 38% >gnl|Dyak-washu-assembly_040407|Contig8.19 7e-52 38% >4_group3 type=chromosome; loc=4_group3:1..11635473; ID=4_group3;release=r1.03; species=dpse 2e-52 36% >gnl|Dana-agencourt-040714-asm|contig_1089 3e-51 39% >gnl|Dsim-washu-w501-asm|Contig27.138 3e-51 39% >gnl|Dvir-agencourt-run1029-asm|contig_2233 6e-31 44% >gnl|Dmov-agencourt-run0811-asm|contig_181 0.060 50% >gnl|Dsim-washu-w501-asm|Contig18.273 3e-06 41% >gi|24653603|ref|NM_166046.1| Drosophila melanogaster CG8553-PB, isoform B (SelD) mRNA, complete cds e-130 60% >gi|58390119|ref|XM_317503.2| Anopheles gambiae str. PEST ENSANGP00000010100 e-126 59% Gene prediction >gnl|Dmov-agencourt-run0811-asm|contig_717 Length = 57670 Score = 41.8 bits (131), Expect = 0.003 Identities = 18/44 (40%), Positives = 24/44 (54%) Frame = -1 Query: 12 FQNHFEPGVYVCAKCGYELFSSRSKYAHSSPWPAFTETIHADSV 55 + H+E GVY C C +LFSS +KY WPAF + + V Sbjct: 21133 YNKHYEKGVYRCIVCHQDLFSSDTKYDSGCGWPAFNDVLDKGKV 21002 Score = 36.5 bits (112), Expect = 0.13 Identities = 21/46 (45%), Positives = 29/46 (63%) Frame = -3 Query: 59 PEHNRSEALKVSCGKCGNGLGHEFLNDGPKPGQSRF*IFSSSLKFV 104 PE R+E V C +C +GH F DGPKP + R+ I S+S++FV Sbjct: 18638 PERIRTE---VRCARCNAHMGHVF-EDGPKPTRKRYCINSASIEFV 18513 Score = 35.1 bits (107), Expect = 0.36 Identities = 18/37 (48%), Positives = 24/37 (64%) Frame = -2 Query: 68 KVSCGKCGNGLGHEFLNDGPKPGQSRF*IFSSSLKFV 104 +V C KC +GH F +DGP P + RF I S+S+ FV Sbjct: 20268 EVRCSKCSAHMGHVF-DDGPPPKHRRFCINSASIDFV 20161 >gnl|Dana-agencourt-040714-asm|contig_4797 Length = 61609 Score = 41.5 bits (130), Expect = 0.004 Identities = 18/44 (40%), Positives = 24/44 (54%) Frame = -3 Query: 12 FQNHFEPGVYVCAKCGYELFSSRSKYAHSSPWPAFTETIHADSV 55 + H+E GVY C C +LFSS +KY WPAF + + V Sbjct: 43103 YNKHYEKGVYQCIVCHQDLFSSDTKYDSGCGWPAFNDVLDKGKV 42972 Score = 40.4 bits (126), Expect = 0.009 Identities = 25/55 (45%), Positives = 34/55 (61%) Frame = -3 Query: 59 PEHNRSEALKVSCGKCGNGLGHEFLNDGPKPGQSRF*IFSSSLKFVPKGKETSAS 113 PE R+E V C +C +GH F DGPKP + R+ I S+S++FV K+ SAS Sbjct: 40394 PERIRTE---VRCARCSAHMGHVF-EDGPKPTRKRYCINSASIEFVTGEKDPSAS 40242 Score = 35.6 bits (109), Expect = 0.24 Identities = 27/70 (38%), Positives = 37/70 (52%), Gaps = 1/70 (1%) Frame = -3 Query: 36 KYAHSSPWPAFTETIHADSV-AKRPEHNRSEALKVSCGKCGNGLGHEFLNDGPKPGQSRF 94 K+ HSS TI + V +K R+E V C +C +GH F +DGP P + RF Sbjct: 42236 KFRHSSHTSITVNTIKSTVVISKTLGMVRTE---VRCSRCSAHMGHVF-DDGPPPKHRRF 42069 Query: 95 *IFSSSLKFV 104 I S+S+ FV Sbjct: 42068 CINSASIDFV 42039 >gnl|Dyak-washu-assembly_040407|Contig7.13 Length = 111046 Score = 41.5 bits (130), Expect = 0.004 Identities = 18/44 (40%), Positives = 24/44 (54%) Frame = +1 Query: 12 FQNHFEPGVYVCAKCGYELFSSRSKYAHSSPWPAFTETIHADSV 55 + H+E GVY C C +LFSS +KY WPAF + + V Sbjct: 46012 YNKHYEKGVYQCIVCHQDLFSSDTKYDSGCGWPAFNDVLDKGKV 46143 Score = 37.6 bits (116), Expect = 0.062 Identities = 23/54 (42%), Positives = 32/54 (59%) Frame = +1 Query: 59 PEHNRSEALKVSCGKCGNGLGHEFLNDGPKPGQSRF*IFSSSLKFVPKGKETSA 112 PE R+E V C +C +GH F DGPKP + R+ I S+S++FV TS+ Sbjct: 48979 PERIRTE---VRCARCNAHMGHVF-EDGPKPTRKRYCINSASIEFVNADPATSS 49128 Score = 35.3 bits (108), Expect = 0.29 Identities = 18/45 (40%), Positives = 26/45 (57%) Frame = +2 Query: 68 KVSCGKCGNGLGHEFLNDGPKPGQSRF*IFSSSLKFVPKGKETSA 112 +V C +C +GH F +DGP P + RF I S+S+ FV + A Sbjct: 47174 EVRCSRCSAHMGHVF-DDGPPPKHRRFCINSASIDFVKSATPSKA 47305 >gnl|Dvir-agencourt-run1029-asm|contig_386 Length = 191045 Score = 41.5 bits (130), Expect = 0.004 Identities = 18/44 (40%), Positives = 24/44 (54%) Frame = -1 Query: 12 FQNHFEPGVYVCAKCGYELFSSRSKYAHSSPWPAFTETIHADSV 55 + H+E GVY C C +LFSS +KY WPAF + + V Sbjct: 176732 YNKHYEKGVYQCIVCHQDLFSSDTKYDSGCGWPAFNDVLDKGKV 176601 Score = 36.5 bits (112), Expect = 0.13 Identities = 21/46 (45%), Positives = 29/46 (63%) Frame = -3 Query: 59 PEHNRSEALKVSCGKCGNGLGHEFLNDGPKPGQSRF*IFSSSLKFV 104 PE R+E V C +C +GH F DGPKP + R+ I S+S++FV Sbjct: 174108 PERIRTE---VRCARCNAHMGHVF-EDGPKPTRKRYCINSASIEFV 173983 Score = 35.6 bits (109), Expect = 0.24 Identities = 18/39 (46%), Positives = 25/39 (64%) Frame = -3 Query: 68 KVSCGKCGNGLGHEFLNDGPKPGQSRF*IFSSSLKFVPK 106 +V C KC +GH F +DGP P + RF I S+S+ FV + Sbjct: 175752 EVRCSKCSAHMGHVF-DDGPPPKHRRFCINSASIDFVKR 175639 >2 type=chromosome; loc=2:1..30711475; ID=2; release=r1.03; species=dpse Length = 30711475 Score = 41.5 bits (130), Expect = 0.004 BLAST HIT on Genome Map Identities = 18/44 (40%), Positives = 24/44 (54%) Frame = -3 Query: 12 FQNHFEPGVYVCAKCGYELFSSRSKYAHSSPWPAFTETIHADSV 55 + H+E GVY C C +LFSS +KY WPAF + + V Sbjct: 28012277 YNKHYEKGVYQCIVCHQDLFSSDTKYDSGCGWPAFNDVLDKGKV 28012146 Score = 36.5 bits (112), Expect = 0.13 BLAST HIT on Genome Map Identities = 21/46 (45%), Positives = 29/46 (63%) Frame = -1 Query: 59 PEHNRSEALKVSCGKCGNGLGHEFLNDGPKPGQSRF*IFSSSLKFV 104 PE R+E V C +C +GH F DGPKP + R+ I S+S++FV Sbjct: 28009705 PERIRTE---VRCARCNAHMGHVF-EDGPKPTRKRYCINSASIEFV 28009580 Score = 35.9 bits (110), Expect = 0.20 BLAST HIT on Genome Map Identities = 22/46 (47%), Positives = 29/46 (63%) Frame = -2 Query: 68 KVSCGKCGNGLGHEFLNDGPKPGQSRF*IFSSSLKFVPKGKETSAS 113 +V C KC +GH F +DGP P + RF I S+S+ FV K T+AS Sbjct: 28011345 EVRCSKCSAHMGHVF-DDGPPPKHHRFCINSASIDFV-KSAPTAAS 28011214 >gnl|Dere-agencourt-run1028-asm|contig_1017 Length = 31091 Score = 41.5 bits (130), Expect = 0.004 Identities = 18/44 (40%), Positives = 24/44 (54%) Frame = +3 Query: 12 FQNHFEPGVYVCAKCGYELFSSRSKYAHSSPWPAFTETIHADSV 55 + H+E GVY C C +LFSS +KY WPAF + + V Sbjct: 19356 YNKHYEKGVYQCIVCHQDLFSSDTKYDSGCGWPAFNDVLDKGKV 19487 Score = 37.6 bits (116), Expect = 0.062 Identities = 23/54 (42%), Positives = 32/54 (59%) Frame = +2 Query: 59 PEHNRSEALKVSCGKCGNGLGHEFLNDGPKPGQSRF*IFSSSLKFVPKGKETSA 112 PE R+E V C +C +GH F DGPKP + R+ I S+S++FV TS+ Sbjct: 22229 PERIRTE---VRCARCNAHMGHVF-EDGPKPTRKRYCINSASIEFVNADPATSS 22378 Score = 35.3 bits (108), Expect = 0.29 Identities = 18/45 (40%), Positives = 26/45 (57%) Frame = +1 Query: 68 KVSCGKCGNGLGHEFLNDGPKPGQSRF*IFSSSLKFVPKGKETSA 112 +V C +C +GH F +DGP P + RF I S+S+ FV + A Sbjct: 20458 EVRCSRCSAHMGHVF-DDGPPPKHRRFCINSASIDFVKSATPSKA 20589 >gnl|Dsim-washu-w501-asm|Contig20.125 Length = 16361 Score = 41.5 bits (130), Expect = 0.004 Identities = 18/44 (40%), Positives = 24/44 (54%) Frame = +3 Query: 12 FQNHFEPGVYVCAKCGYELFSSRSKYAHSSPWPAFTETIHADSV 55 + H+E GVY C C +LFSS +KY WPAF + + V Sbjct: 4344 YNKHYEKGVYQCIVCHQDLFSSDTKYDSGCGWPAFNDVLDKGKV 4475 Score = 37.6 bits (116), Expect = 0.062 Identities = 23/54 (42%), Positives = 32/54 (59%) Frame = +1 Query: 59 PEHNRSEALKVSCGKCGNGLGHEFLNDGPKPGQSRF*IFSSSLKFVPKGKETSA 112 PE R+E V C +C +GH F DGPKP + R+ I S+S++FV TS+ Sbjct: 7255 PERIRTE---VRCARCNAHMGHVF-EDGPKPTRKRYCINSASIEFVNADPATSS 7404 Score = 35.3 bits (108), Expect = 0.29 Identities = 18/45 (40%), Positives = 26/45 (57%) Frame = +2 Query: 68 KVSCGKCGNGLGHEFLNDGPKPGQSRF*IFSSSLKFVPKGKETSA 112 +V C +C +GH F +DGP P + RF I S+S+ FV + A Sbjct: 5453 EVRCSRCSAHMGHVF-DDGPPPKHRRFCINSASIDFVKSATPSKA 5584 >gi|27605048|emb|BX031767.1|CNS08WOB Single read from an extremity of a full-length cDNA clone made from Anopheles gambiae total adult females. 3-PRIME end of clone FK0AAA47BD04 of strain 6-9 of Anopheles gambiae (African malaria mosquito) Score = 75.5 bits (184), Expect = 3e-14 Identities = 38/100 (38%), Positives = 52/100 (52%), Gaps = 2/100 (2%) Frame = -2 Query: 12 FQNHFEPGVYVCAKCGYELFSSRSKYAHSSPWPAFTETIHADSVA--KRPEHNRSEALKV 69 + +E G Y+C C ELFSS +KY WPAF + + V K P +V Sbjct: 865 YNKFYEKGTYICVVCSQELFSSETKYDSGCGWPAFNDVLDQGKVTLHKDPSIPGRVRTEV 686 Query: 70 SCGKCGNGLGHEFLNDGPKPGQSRF*IFSSSLKFVPKGKE 109 C KC +GH F DGP P + R+ I S+S++F+P G E Sbjct: 685 RCSKCAAHMGHVF-EDGPPPTRKRYCINSASIEFMPAGSE 569 SECIS elements Multiple alignment SelT tblastn >gnl|Dana-agencourt-040714-asm|contig_1097 3e-30 43% >gnl|Dvir-agencourt-run1029-asm|contig_992 5e-30 42% >4_group4 type=chromosome; loc=4_group4:1..6604331;ID=4_group4;release=r1.03; species=dpse 6e-30 40% >gi|24581818|ref|NM_135053.2| Drosophila melanogaster CG3887-PA (CG3887) mRNA, complete cds 2e-37 43% >gi|42765836|gb|AY440807.1| Armigeres subalbatus ASAP ID: 42399 putative: selenoprotein T mRNA. 1e-36 44% >gi|42761900|gb|AY433028.1| Aedes aegypti ASAP ID: 36405 putative: selenoprotein T mRNA. 7e-34 48% >gi|58393304|ref|XM_319979.2| Anopheles gambiae str.PEST ENSANGP00000016703 8e-33 45% Gene prediction >gi|42765836|gb|AY440807.1| Armigeres subalbatus ASAP ID:42399 putative:selenoprotein T mRNA Score = 151 bits (381), Expect = 1e-36 Identities = 69/154 (44%), Positives = 101/154 (65%) Frame = +2 Query: 25 GPLLKFQICVS*GYRRVFEEYMRVISQRYPDIRIEGENYLPQPIYRHIASFLSVFKLVLI 84 G + F C S GYR+ F+EY +I ++YP+I I G NY P +++ L V KL++I Sbjct: 218 GATMTFMYCYSCGYRKAFDEYYNIIHEKYPEITIRGGNYDPPGFNMYLSKILLVTKLLMI 397 Query: 85 GLIIVGKDPFAFFGMQAPSIWQWGQENKVYACMMVFFLSNMIENQCMSTGAFEITLNDVP 144 ++ D F F PS W+W +NK+YACM+VFFL NM+E Q +S+GAFEI LNDVP Sbjct: 398 IALVSNFDLFGFLRQPMPSWWRWCTDNKIYACMLVFFLGNMLEAQLISSGAFEIALNDVP 577 Query: 145 VWSKLESGHLPSMQQLVQILDNEMKLNVHMDSIP 178 VW KLE+G +P+ Q+L QI+D+ ++ + ++ P Sbjct: 578 VWPKLETGRIPAPQELFQIIDSHLQFSDKIEQNP 679 >gi|42761900|gb|AY433028.1| Aedes aegypti ASAP ID: 36405 putative: selenoprotein T mRNA Score = 140 bits (353), Expect = 2e-33 Identities = 63/132 (47%), Positives = 90/132 (68%) Frame = +3 Query: 25 GPLLKFQICVS*GYRRVFEEYMRVISQRYPDIRIEGENYLPQPIYRHIASFLSVFKLVLI 84 G + F C S GYR+ ++EY +I ++YP+I I G NY P +++ V KL +I Sbjct: 324 GATMTFMYCYSCGYRKAYDEYYNIIHEKYPEITIRGANYDPPGFNMYLSKIXLVAKLAMI 503 Query: 85 GLIIVGKDPFAFFGMQAPSIWQWGQENKVYACMMVFFLSNMIENQCMSTGAFEITLNDVP 144 +++ + F F ++ PS WQW +NK+YACMMVFFL NM+E Q +S+GAFEI+LNDVP Sbjct: 504 MVLMSNFNLFGFLNLRIPSWWQWCTDNKMYACMMVFFLGNMLEAQLISSGAFEISLNDVP 683 Query: 145 VWSKLESGHLPS 156 VWSKLE+G +P+ Sbjct: 684 VWSKLETGRIPA 719 >gi|58393304|ref|XM_319979.2| Anopheles gambiae str. PEST ENSANGP00000016703, partial mRNA Score = 138 bits (348), Expect = 8e-33 Identities = 66/145 (45%), Positives = 93/145 (64%) Frame = +1 Query: 25 GPLLKFQICVS*GYRRVFEEYMRVISQRYPDIRIEGENYLPQPIYRHIASFLSVFKLVLI 84 G + F C S GYR+ F++Y +I ++YP+I I G NY P + ++ L V KL+LI Sbjct: 130 GATMTFLYCYSCGYRKAFDDYHNLILEKYPEITIRGSNYDPSGVNMLLSKVLLVTKLLLI 309 Query: 85 GLIIVGKDPFAFFGMQAPSIWQWGQENKVYACMMVFFLSNMIENQCMSTGAFEITLNDVP 144 ++ D + G WQW NK+YA MM+FFL N +E Q +S+GAFEITLNDVP Sbjct: 310 AALMSNYDIGRYIGNPFAGWWQWCFNNKLYASMMIFFLGNTLEAQLISSGAFEITLNDVP 489 Query: 145 VWSKLESGHLPSMQQLVQILDNEMK 169 VWSKLE+G P+ Q++ QI+DN ++ Sbjct: 490 VWSKLETGRFPAPQEMFQIIDNHLQ 564 SECIS elements Multiple alignment SelR tblastn >gnl|Dmov-agencourt-run0811-asm|contig_717 0.003 40% >gnl|Dana-agencourt-040714-asm|contig_4797 0.004 40% >gnl|Dyak-washu-assembly_040407|Contig7.13 0.004 40% >gnl|Dvir-agencourt-run1029-asm|contig_386 0.004 40% >2 type=chromosome; loc=2:1..30711475; ID=2; release=r1.03; species=dpse 0.004 40% >gnl|Dere-agencourt-run1028-asm|contig_1017 0.004 40% >gnl|Dsim-washu-w501-asm|Contig20.125 0.004 40% >gi|27605048|emb|BX031767.1|CNS08WOB Single read from an extremity of a full-length cDNA clone made from Anopheles gambiae total adult females.3-PRIME 3e-14 38% Gene prediction >gnl|Dmov-agencourt-run0811-asm|contig_717 Length = 57670 Score = 41.8 bits (131), Expect = 0.003 Identities = 18/44 (40%), Positives = 24/44 (54%) Frame = -1 Query: 12 FQNHFEPGVYVCAKCGYELFSSRSKYAHSSPWPAFTETIHADSV 55 + H+E GVY C C +LFSS +KY WPAF + + V Sbjct: 21133 YNKHYEKGVYRCIVCHQDLFSSDTKYDSGCGWPAFNDVLDKGKV 21002 Score = 36.5 bits (112), Expect = 0.13 Identities = 21/46 (45%), Positives = 29/46 (63%) Frame = -3 Query: 59 PEHNRSEALKVSCGKCGNGLGHEFLNDGPKPGQSRF*IFSSSLKFV 104 PE R+E V C +C +GH F DGPKP + R+ I S+S++FV Sbjct: 18638 PERIRTE---VRCARCNAHMGHVF-EDGPKPTRKRYCINSASIEFV 18513 Score = 35.1 bits (107), Expect = 0.36 Identities = 18/37 (48%), Positives = 24/37 (64%) Frame = -2 Query: 68 KVSCGKCGNGLGHEFLNDGPKPGQSRF*IFSSSLKFV 104 +V C KC +GH F +DGP P + RF I S+S+ FV Sbjct: 20268 EVRCSKCSAHMGHVF-DDGPPPKHRRFCINSASIDFV 20161 >gnl|Dana-agencourt-040714-asm|contig_4797 Length = 61609 Score = 41.5 bits (130), Expect = 0.004 Identities = 18/44 (40%), Positives = 24/44 (54%) Frame = -3 Query: 12 FQNHFEPGVYVCAKCGYELFSSRSKYAHSSPWPAFTETIHADSV 55 + H+E GVY C C +LFSS +KY WPAF + + V Sbjct: 43103 YNKHYEKGVYQCIVCHQDLFSSDTKYDSGCGWPAFNDVLDKGKV 42972 Score = 40.4 bits (126), Expect = 0.009 Identities = 25/55 (45%), Positives = 34/55 (61%) Frame = -3 Query: 59 PEHNRSEALKVSCGKCGNGLGHEFLNDGPKPGQSRF*IFSSSLKFVPKGKETSAS 113 PE R+E V C +C +GH F DGPKP + R+ I S+S++FV K+ SAS Sbjct: 40394 PERIRTE---VRCARCSAHMGHVF-EDGPKPTRKRYCINSASIEFVTGEKDPSAS 40242 Score = 35.6 bits (109), Expect = 0.24 Identities = 27/70 (38%), Positives = 37/70 (52%), Gaps = 1/70 (1%) Frame = -3 Query: 36 KYAHSSPWPAFTETIHADSV-AKRPEHNRSEALKVSCGKCGNGLGHEFLNDGPKPGQSRF 94 K+ HSS TI + V +K R+E V C +C +GH F +DGP P + RF Sbjct: 42236 KFRHSSHTSITVNTIKSTVVISKTLGMVRTE---VRCSRCSAHMGHVF-DDGPPPKHRRF 42069 Query: 95 *IFSSSLKFV 104 I S+S+ FV Sbjct: 42068 CINSASIDFV 42039 >gnl|Dyak-washu-assembly_040407|Contig7.13 Length = 111046 Score = 41.5 bits (130), Expect = 0.004 Identities = 18/44 (40%), Positives = 24/44 (54%) Frame = +1 Query: 12 FQNHFEPGVYVCAKCGYELFSSRSKYAHSSPWPAFTETIHADSV 55 + H+E GVY C C +LFSS +KY WPAF + + V Sbjct: 46012 YNKHYEKGVYQCIVCHQDLFSSDTKYDSGCGWPAFNDVLDKGKV 46143 Score = 37.6 bits (116), Expect = 0.062 Identities = 23/54 (42%), Positives = 32/54 (59%) Frame = +1 Query: 59 PEHNRSEALKVSCGKCGNGLGHEFLNDGPKPGQSRF*IFSSSLKFVPKGKETSA 112 PE R+E V C +C +GH F DGPKP + R+ I S+S++FV TS+ Sbjct: 48979 PERIRTE---VRCARCNAHMGHVF-EDGPKPTRKRYCINSASIEFVNADPATSS 49128 Score = 35.3 bits (108), Expect = 0.29 Identities = 18/45 (40%), Positives = 26/45 (57%) Frame = +2 Query: 68 KVSCGKCGNGLGHEFLNDGPKPGQSRF*IFSSSLKFVPKGKETSA 112 +V C +C +GH F +DGP P + RF I S+S+ FV + A Sbjct: 47174 EVRCSRCSAHMGHVF-DDGPPPKHRRFCINSASIDFVKSATPSKA 47305 >gnl|Dvir-agencourt-run1029-asm|contig_386 Length = 191045 Score = 41.5 bits (130), Expect = 0.004 Identities = 18/44 (40%), Positives = 24/44 (54%) Frame = -1 Query: 12 FQNHFEPGVYVCAKCGYELFSSRSKYAHSSPWPAFTETIHADSV 55 + H+E GVY C C +LFSS +KY WPAF + + V Sbjct: 176732 YNKHYEKGVYQCIVCHQDLFSSDTKYDSGCGWPAFNDVLDKGKV 176601 Score = 36.5 bits (112), Expect = 0.13 Identities = 21/46 (45%), Positives = 29/46 (63%) Frame = -3 Query: 59 PEHNRSEALKVSCGKCGNGLGHEFLNDGPKPGQSRF*IFSSSLKFV 104 PE R+E V C +C +GH F DGPKP + R+ I S+S++FV Sbjct: 174108 PERIRTE---VRCARCNAHMGHVF-EDGPKPTRKRYCINSASIEFV 173983 Score = 35.6 bits (109), Expect = 0.24 Identities = 18/39 (46%), Positives = 25/39 (64%) Frame = -3 Query: 68 KVSCGKCGNGLGHEFLNDGPKPGQSRF*IFSSSLKFVPK 106 +V C KC +GH F +DGP P + RF I S+S+ FV + Sbjct: 175752 EVRCSKCSAHMGHVF-DDGPPPKHRRFCINSASIDFVKR 175639 >2 type=chromosome; loc=2:1..30711475; ID=2; release=r1.03; species=dpse Length = 30711475 Score = 41.5 bits (130), Expect = 0.004 BLAST HIT on Genome Map Identities = 18/44 (40%), Positives = 24/44 (54%) Frame = -3 Query: 12 FQNHFEPGVYVCAKCGYELFSSRSKYAHSSPWPAFTETIHADSV 55 + H+E GVY C C +LFSS +KY WPAF + + V Sbjct: 28012277 YNKHYEKGVYQCIVCHQDLFSSDTKYDSGCGWPAFNDVLDKGKV 28012146 Score = 36.5 bits (112), Expect = 0.13 BLAST HIT on Genome Map Identities = 21/46 (45%), Positives = 29/46 (63%) Frame = -1 Query: 59 PEHNRSEALKVSCGKCGNGLGHEFLNDGPKPGQSRF*IFSSSLKFV 104 PE R+E V C +C +GH F DGPKP + R+ I S+S++FV Sbjct: 28009705 PERIRTE---VRCARCNAHMGHVF-EDGPKPTRKRYCINSASIEFV 28009580 Score = 35.9 bits (110), Expect = 0.20 BLAST HIT on Genome Map Identities = 22/46 (47%), Positives = 29/46 (63%) Frame = -2 Query: 68 KVSCGKCGNGLGHEFLNDGPKPGQSRF*IFSSSLKFVPKGKETSAS 113 +V C KC +GH F +DGP P + RF I S+S+ FV K T+AS Sbjct: 28011345 EVRCSKCSAHMGHVF-DDGPPPKHHRFCINSASIDFV-KSAPTAAS 28011214 >gnl|Dere-agencourt-run1028-asm|contig_1017 Length = 31091 Score = 41.5 bits (130), Expect = 0.004 Identities = 18/44 (40%), Positives = 24/44 (54%) Frame = +3 Query: 12 FQNHFEPGVYVCAKCGYELFSSRSKYAHSSPWPAFTETIHADSV 55 + H+E GVY C C +LFSS +KY WPAF + + V Sbjct: 19356 YNKHYEKGVYQCIVCHQDLFSSDTKYDSGCGWPAFNDVLDKGKV 19487 Score = 37.6 bits (116), Expect = 0.062 Identities = 23/54 (42%), Positives = 32/54 (59%) Frame = +2 Query: 59 PEHNRSEALKVSCGKCGNGLGHEFLNDGPKPGQSRF*IFSSSLKFVPKGKETSA 112 PE R+E V C +C +GH F DGPKP + R+ I S+S++FV TS+ Sbjct: 22229 PERIRTE---VRCARCNAHMGHVF-EDGPKPTRKRYCINSASIEFVNADPATSS 22378 Score = 35.3 bits (108), Expect = 0.29 Identities = 18/45 (40%), Positives = 26/45 (57%) Frame = +1 Query: 68 KVSCGKCGNGLGHEFLNDGPKPGQSRF*IFSSSLKFVPKGKETSA 112 +V C +C +GH F +DGP P + RF I S+S+ FV + A Sbjct: 20458 EVRCSRCSAHMGHVF-DDGPPPKHRRFCINSASIDFVKSATPSKA 20589 >gnl|Dsim-washu-w501-asm|Contig20.125 Length = 16361 Score = 41.5 bits (130), Expect = 0.004 Identities = 18/44 (40%), Positives = 24/44 (54%) Frame = +3 Query: 12 FQNHFEPGVYVCAKCGYELFSSRSKYAHSSPWPAFTETIHADSV 55 + H+E GVY C C +LFSS +KY WPAF + + V Sbjct: 4344 YNKHYEKGVYQCIVCHQDLFSSDTKYDSGCGWPAFNDVLDKGKV 4475 Score = 37.6 bits (116), Expect = 0.062 Identities = 23/54 (42%), Positives = 32/54 (59%) Frame = +1 Query: 59 PEHNRSEALKVSCGKCGNGLGHEFLNDGPKPGQSRF*IFSSSLKFVPKGKETSA 112 PE R+E V C +C +GH F DGPKP + R+ I S+S++FV TS+ Sbjct: 7255 PERIRTE---VRCARCNAHMGHVF-EDGPKPTRKRYCINSASIEFVNADPATSS 7404 Score = 35.3 bits (108), Expect = 0.29 Identities = 18/45 (40%), Positives = 26/45 (57%) Frame = +2 Query: 68 KVSCGKCGNGLGHEFLNDGPKPGQSRF*IFSSSLKFVPKGKETSA 112 +V C +C +GH F +DGP P + RF I S+S+ FV + A Sbjct: 5453 EVRCSRCSAHMGHVF-DDGPPPKHRRFCINSASIDFVKSATPSKA 5584 >gi|27605048|emb|BX031767.1|CNS08WOB Single read from an extremity of a full-length cDNA clone made from Anopheles gambiae total adult females. 3-PRIME end of clone FK0AAA47BD04 of strain 6-9 of Anopheles gambiae (African malaria mosquito) Score = 75.5 bits (184), Expect = 3e-14 Identities = 38/100 (38%), Positives = 52/100 (52%), Gaps = 2/100 (2%) Frame = -2 Query: 12 FQNHFEPGVYVCAKCGYELFSSRSKYAHSSPWPAFTETIHADSVA--KRPEHNRSEALKV 69 + +E G Y+C C ELFSS +KY WPAF + + V K P +V Sbjct: 865 YNKFYEKGTYICVVCSQELFSSETKYDSGCGWPAFNDVLDQGKVTLHKDPSIPGRVRTEV 686 Query: 70 SCGKCGNGLGHEFLNDGPKPGQSRF*IFSSSLKFVPKGKE 109 C KC +GH F DGP P + R+ I S+S++F+P G E Sbjct: 685 RCSKCAAHMGHVF-EDGPPPTRKRYCINSASIEFMPAGSE 569 SECIS elements Multiple alignment GPx4 After tblastn similarity search,the all five human GPX selenoproteins, were found to match with the same DNA sequence in each of the species. This fact was probably due to the high similarity of the human GPx selenoproteins. Thus, we decided to determine, for each species, to which of the human selenoproteins was our common DNA sequence more closely related. To find out our best candidate we could have based on e-values and homology percentage shown in tblastn results, but in order to make our choice more confident we performed a multiple sequence aligment followed by an unrooted N-J tree, where we included all human GPx proteins and the protein we found for each of the species. tblastn As we can notice in the previous trees, all the proteins that we have studied are closer to GPx4 than to any other human GPx selenoprotein. Although this fact doesn't mean that all the proteins found in insect genomes correspond with human GPx4 function. Anyway, we will use this protein for gene prediction and multiple alingment and only GPx4 tblastn results will be shown. >gnl|Dyak-washu-assembly_040407|Contig5.61 2e-25 46% >gnl|Dere-agencourt-run1028-asm|contig_2920 1e-24 39% >XR_group8 type=chromosome; loc=XR_group8:1..9190824 ID=XR_group release=r1.03; species=dpse 6e-23 38% >3 type=chromosome; loc=3:1..19738957;ID=3; release=r1.03; species=dps 1e-20 39% >gnl|Dsim-washu-w501-asm|Contig12.100 5e-14 39% >gnl|Dana-agencourt-040714-asm|contig_263 7e-14 49% >gnl|Dana-agencourt-040714-asm|contig_4443 3e-13 48% >gnl|Dana-agencourt-040714-asm|contig_1256 3e-13 39% >gnl|Dvir-agencourt-run1029-asm|contig_1983 1e-13 48% >gnl|Dere-agencourt-run1028-asm|contig_2847 1e-13 37% >gnl|Dsim-washu-w501-asm|Contig0.495 1e-13 46% >gnl|Dyak-washu-assembly_040407|Contig15.28 2e-12 38% >gnl|Dmov-agencourt-run0811-asm|contig_5053 3e-06 40% >gi|48117434|ref|XM_396418.1| Apis mellifera similar to putative thioredoxin perxidase 1e-47 55% >gi|33306812|gb|AF394234.1| Aedes aegypti glutathione peroxidase (GPx) mRNA, 2e-44 56% >gi|58384275|ref|XM_313166.2| Anopheles gambiae str. PEST ENSANGP00000024750 1e-34 46% >gi|50897528|gb|AY625510.1| Glossina morsitans morsitans clone Gmm0092 putative glutathione 6e-33 45% >gi|42765409|gb|AY440380.1| Armigeres subalbatus ASAP ID: 42545 glutathione peroxidase mRNA 5e-29 48% >gi|28564458|emb|AJ547804.1|IRI547804 Ixodes ricinus mRNA for glutathione peroxidase (gluper1) 1e-17 48% >gi|12958610|gb|AF321612.1| Venturia canescens virus-like particle protein mRNA 7e-18 33% >gi|40882422|gb|BT011331.1| Drosophila melanogaster SD18370 full insert cDNA 2e-34 44% Gene prediction >gnl|Dyak-washu-assembly_040407|Contig5.61(contig is not correct, exonerate can not be performed) Length = 146236 Score = 76.1 bits (253), Expect(2) = 2e-25 Identities = 40/86 (46%), Positives = 60/86 (69%) Frame = +1 Query: 30 ASRDDWRCARSMHEFSAKDIDGHMVNLDKYRGFVCIVTNVASQXGKTEVNYTQLVDLHAR 89 ++ D++ A S++EF+ KD G+ ++L+KY+G V +V N+AS+ G T+ NY +L DL + Sbjct: 63946 SANGDYKNAASIYEFTVKDTHGNDISLEKYKGKVVLVVNIASKCGLTKNNYQKLTDLKEK 64125 Query: 90 YAECGLRILAFPCNQFGKQEPGSNEE 115 Y E GL IL FPCNQFG Q P ++ E Sbjct: 64126 YGERGLVILNFPCNQFGSQMPEADGE 64203 Score = 60.1 bits (196), Expect(2) = 2e-25 Identities = 32/62 (51%), Positives = 44/62 (70%) Frame = +2 Query: 132 KICVNGDDAHPLWKWMKIQPKGKGILGNAIKWNFTKFLIDKNGCVVKRYGPMEEPLVIEK 191 ++ VNGDDA PL+K++K K G LG+ IKWNFTKFL++K G + RY P +P+ I K Sbjct: 64322 QVDVNGDDAAPLYKYLKA--KQTGTLGSGIKWNFTKFLVNKEGIPINRYAPTTDPMDIAK 64495 Query: 192 DL 193 D+ Sbjct: 64496 DI 64501 >gnl|Dere-agencourt-run1028-asm|contig_2920 Length = 10481 Score = 113 bits (385), Expect = 1e-24 Identities = 74/188 (39%), Positives = 110/188 (58%), Gaps = 24/188 (12%) Frame = +3 Query: 30 ASRDDWRCARSMHEFSAKDIDGHMVNLDKYRGFVCIVTNVASQXGKTEVNYTQLVDLHAR 89 ++ D++ A S++EF+ KD G+ ++L+KY+G V +V N+AS+ G T+ NY +L DL + Sbjct: 7311 SANGDYKNAASIYEFTVKDTHGNDISLEKYKGKVVLVVNIASKCGLTKNNYQKLTDLKEK 7490 Query: 90 YAECGLRILAFPCNQFGKQEPGSNEE----------------------IKEFAAGYNVKF 127 Y E GL IL FPCNQFG Q P ++ E +K +F Sbjct: 7491 YGERGLVILNFPCNQFGSQMPEADGEAMVCHLRDSKADIGEVFAKVRLLKLVVGCSRPRF 7670 Query: 128 D--MFSKICVNGDDAHPLWKWMKIQPKGKGILGNAIKWNFTKFLIDKNGCVVKRYGPMEE 185 + F ++ VNGD+A PL+K++K K G LG+ IKWNFTKFL++K G + RY P + Sbjct: 7671 NNLRFLQVDVNGDNAAPLYKYLK--AKQTGTLGSGIKWNFTKFLVNKEGVPINRYAPTTD 7844 Query: 186 PLVIEKDL 193 P+ I KD+ Sbjct: 7845 PMDISKDI 7868 >gnl|Dsim-washu-w501-asm|Contig12.100 Length = 22061 Score = 71.1 bits (235), Expect(2) = 5e-14 Identities = 48/122 (39%), Positives = 64/122 (52%), Gaps = 2/122 (1%) Frame = -2 Query: 74 GKTEVNYTQLVDLHARYAECGLRILAFPCNQFGKQEPGSN--EEIKEFAAGYNVKFDMFS 131 G T Y L L Y E GLRIL FPCNQFG Q P S+ E + + +F+ Sbjct: 8890 GLTLSQYNGLRYLLEEYEEQGLRILNFPCNQFGGQMPESDGQEMLDHLRREGANIGHLFA 8711 Query: 132 KICVNGDDAHPLWKWMKIQPKGKGILGNAIKWNFTKFLIDKNGCVVKRYGPMEEPLVIEK 191 KI V G A PL+K + + + I+WNF KFL+D+ G + KRYG EP+ + Sbjct: 8710 KIDVKGAQADPLYKLLTRHQ-------HDIEWNFVKFLVDRKGNIHKRYGAELEPVALTD 8552 Query: 192 DL 193 D+ Sbjct: 8551 DI 8546 Score = 26.9 bits (78), Expect(2) = 5e-14 Identities = 13/40 (32%), Positives = 24/40 (60%) Frame = -3 Query: 34 DWRCARSMHEFSAKDIDGHMVNLDKYRGFVCIVTNVASQX 73 D R ++H ++ +D G+ V LD + G V ++ N+AS+ Sbjct: 9072 DMRWRLTIHALTVRDTFGNPVQLDIFAGHVMLIVNIASR* 8953 >gnl|Dere-agencourt-run1028-asm|contig_2847 Length = 53404 Score = 70.2 bits (232), Expect(2) = 1e-13 Identities = 46/122 (37%), Positives = 64/122 (52%), Gaps = 2/122 (1%) Frame = +1 Query: 74 GKTEVNYTQLVDLHARYAECGLRILAFPCNQFGKQEPGSN--EEIKEFAAGYNVKFDMFS 131 G T Y L L Y + GLRIL FPCNQFG Q P S+ E + + +F+ Sbjct: 4213 GLTSSQYNGLRYLLEEYEDRGLRILNFPCNQFGGQMPESDGQEMLDHLRREGANIGHIFA 4392 Query: 132 KICVNGDDAHPLWKWMKIQPKGKGILGNAIKWNFTKFLIDKNGCVVKRYGPMEEPLVIEK 191 K+ V G A PL+K + + + I+WNF KFL+D+ G + KRYG EP+ + Sbjct: 4393 KVDVKGAQADPLYKLLTRHQQD-------IEWNFVKFLVDRKGNIHKRYGAELEPVALTD 4551 Query: 192 DL 193 D+ Sbjct: 4552 DI 4557 Score = 26.3 bits (76), Expect(2) = 1e-13 Identities = 13/42 (30%), Positives = 25/42 (59%) Frame = +3 Query: 34 DWRCARSMHEFSAKDIDGHMVNLDKYRGFVCIVTNVASQXGK 75 D R +++ ++ +D G V LDK+ G V ++ N+AS+ + Sbjct: 4038 DMRWRLTIQALTVRDTFGKPVQLDKFAGHVMLIVNIASK*ER 4163 >gnl|Dana-agencourt-040714-asm|contig_4443 Length = 7909 Score = 46.9 bits (149), Expect(2) = 3e-13 Identities = 27/56 (48%), Positives = 35/56 (62%) Frame = +1 Query: 60 RGFVCIVTNVASQXGKTEVNYTQLVDLHARYAECGLRILAFPCNQFGKQEPGSNEE 115 +G V +V N+ASQ G T+ NY L DL +Y + G IL FPCNQFG Q ++ E Sbjct: 1363 KGQVFLVVNIASQCGLTKNNYQTLTDLKEKYGDIG*IILNFPCNQFGSQMLETDGE 1530 >gnl|Dmov-agencourt-run0811-asm|contig_5053 Length = 2228 Score = 51.7 bits (166), Expect = 3e-06 Identities = 29/71 (40%), Positives = 46/71 (64%) Frame = -2 Query: 31 SRDDWRCARSMHEFSAKDIDGHMVNLDKYRGFVCIVTNVASQXGKTEVNYTQLVDLHARY 90 S D++ A S++EF+ KD G+ V+L+KY+G V ++ N+AS+ G T+ NY +L DL +Y Sbjct: 1636 SDGDYKNAASIYEFNVKDTHGNDVSLEKYKGQVILIVNIASKCGLTKNNYKKLTDLKEKY 1457 Query: 91 AECGLRILAFP 101 E G +P Sbjct: 1456 GERGTDHPELP 1424 >gnl|Dana-agencourt-040714-asm|contig_1256 Length = 119871 Score = 74.2 bits (246), Expect = 3e-13 Identities = 50/126 (39%), Positives = 67/126 (53%), Gaps = 2/126 (1%) Frame = -1 Query: 74 GKTEVNYTQLVDLHARYAECGLRILAFPCNQFGKQEPGSN-EEIKEFAAGYNVKF-DMFS 131 G T Y L +L +Y E GL IL FPCNQFG Q P S+ EI E +F+ Sbjct: 23997 GLTSSQYAGLHELREKYEERGLSILNFPCNQFGAQMPESDGREILEHLRQKKANIGHIFA 23818 Query: 132 KICVNGDDAHPLWKWMKIQPKGKGILGNAIKWNFTKFLIDKNGCVVKRYGPMEEPLVIEK 191 KI VNG +A PL+K + + I+WNF KFLID+ G + RYG ++P V+ Sbjct: 23817 KIKVNGRNADPLYKLLTRK-------APRIEWNFVKFLIDRKGNIYGRYGAEKKPAVLVN 23659 Query: 192 DLPHYF 197 D+ + Sbjct: 23658 DIERLL 23641 >gnl|Dyak-washu-assembly_040407|Contig15.28 Length = 56780 Score = 71.6 bits (237), Expect = 2e-12 Identities = 47/122 (38%), Positives = 64/122 (52%), Gaps = 2/122 (1%) Frame = -1 Query: 74 GKTEVNYTQLVDLHARYAECGLRILAFPCNQFGKQEPGSN--EEIKEFAAGYNVKFDMFS 131 G T Y L L Y + GLRIL FPCNQFG Q P S+ E + + +F+ Sbjct: 32579 GLTSTQYNGLRYLLEEYEDRGLRILNFPCNQFGAQMPESDGQEMLDHLRREGANIGQLFA 32400 Query: 132 KICVNGDDAHPLWKWMKIQPKGKGILGNAIKWNFTKFLIDKNGCVVKRYGPMEEPLVIEK 191 KI V G A PL+K + + + I+WNF KFL+D+ G + KRYG EP+ + Sbjct: 32399 KIDVKGAQADPLYKLLTRHQ-------HDIEWNFVKFLVDRRGNIYKRYGAELEPVALTD 32241 Query: 192 DL 193 D+ Sbjct: 32240 DI 32235 >gi|48117434|ref|XM_396418.1| Apis mellifera similar to putative thioredoxin perxidase (LOC412967), mRNA Score = 187 bits (476), Expect = 1e-47 Identities = 91/165 (55%), Positives = 120/165 (72%), Gaps = 1/165 (0%) Frame = +1 Query: 34 DWRCARSMHEFSAKDIDGHMVNLDKYRGFVCIVTNVASQXGKTEVNYTQLVDLHARYAEC 93 +W+ A ++++F AKDI G+ V+L+KYRG VCI+ NVAS G T+ NY +LV L+ +Y E Sbjct: 163 NWKSASTIYDFHAKDIHGNDVSLNKYRGHVCIIVNVASNCGLTDTNYRELVQLYEKYNEK 342 Query: 94 -GLRILAFPCNQFGKQEPGSNEEIKEFAAGYNVKFDMFSKICVNGDDAHPLWKWMKIQPK 152 GLRILAFP N+FG QEPG++ EI EF YNV FD+F KI VNGD+AHPLWKW+K Q Sbjct: 343 EGLRILAFPSNEFGGQEPGTSVEILEFVKKYNVTFDLFEKINVNGDNAHPLWKWLKTQ-- 516 Query: 153 GKGILGNAIKWNFTKFLIDKNGCVVKRYGPMEEPLVIEKDLPHYF 197 G + + IKWNF+KF+I+K G VV R+ P +PL +E +L YF Sbjct: 517 ANGFITDDIKWNFSKFIINKEGKVVSRFAPTVDPLQMESELKKYF 651 >gi|33306812|gb|AF394234.1| Aedes aegypti glutathione peroxidase (GPx) mRNA, complete cds Score = 177 bits (448), Expect = 2e-44 Identities = 91/160 (56%), Positives = 119/160 (74%), Gaps = 2/160 (1%) Frame = +1 Query: 40 SMHEFSAKDIDGHMVNLDKYRGFVCIVTNVASQXGKTEVNYTQLVDLHARYAEC-GLRIL 98 S+++FSA DIDG+ V+ ++YRG V I+ NVAS+ G T +Y +L +L+ Y E GLRIL Sbjct: 274 SVYDFSAVDIDGNKVDFERYRGHVLIIVNVASKCGYTAGHYKELNELYEEYGETEGLRIL 453 Query: 99 AFPCNQFGKQEPGSNEEIKEFA-AGYNVKFDMFSKICVNGDDAHPLWKWMKIQPKGKGIL 157 AFPCNQFG QEPG+NEEIK FA KFD+F+KI VNGD+AHPLW+++K Q +G G L Sbjct: 454 AFPCNQFGNQEPGTNEEIKHFARVEKGAKFDLFAKIYVNGDEAHPLWQFLK-QRQG-GTL 627 Query: 158 GNAIKWNFTKFLIDKNGCVVKRYGPMEEPLVIEKDLPHYF 197 +AIKWNFTKF++DKNG V+R+GP PL + +L YF Sbjct: 628 FDAIKWNFTKFIVDKNGQPVERHGPQTSPLQLRDNLKKYF 747 >gi|58384275|ref|XM_313166.2| Anopheles gambiae str. PEST ENSANGP00000024750 (ENSANGG00000011473), mRNA Score = 144 bits (364), Expect = 1e-34 Identities = 76/163 (46%), Positives = 109/163 (66%), Gaps = 2/163 (1%) Frame = +1 Query: 33 DDWRCARSMHEFSAKDIDGHMVNLDKYRGFVCIVTNVASQXGKTEVNYTQLVDLHARYAE 92 +D++ A+S+++F+ KD G V+L+KYRG V ++ N+ASQ G T+ NY +L +L +YA+ Sbjct: 118 EDYKNAKSVYDFTVKDSQGADVSLEKYRGKVLLIVNIASQCGLTKGNYAELTELSQKYAD 297 Query: 93 CGLRILAFPCNQFGKQEP-GSNEEIKEFAAGYNVKF-DMFSKICVNGDDAHPLWKWMKIQ 150 +IL+FPCNQFG Q P G EE+ + D+F+KI VNGD AHPL+K++K Sbjct: 298 KDFKILSFPCNQFGGQMPEGDGEEMVCHLRSAKAEVGDVFAKIDVNGDGAHPLYKYLK-- 471 Query: 151 PKGKGILGNAIKWNFTKFLIDKNGCVVKRYGPMEEPLVIEKDL 193 K G LG++IKWNF KFL++K+G V RY P P I KD+ Sbjct: 472 HKQGGTLGDSIKWNFAKFLVNKDGQPVDRYAPTTSPSSIVKDI 600 >gi|50897528|gb|AY625510.1 Glossina morsitans morsitans clone Gmm0092 putative glutathione peroxidase mRNA Score = 139 bits (350), Expect = 6e-33 Identities = 74/161 (45%), Positives = 107/161 (66%), Gaps = 5/161 (3%) Frame = +3 Query: 38 ARSMHEFSAKDIDGHMVNLDKYRGFVCIVTNVASQXGKTEVNYTQLVDLHARYAECGLRI 97 A S+++F+ KD G+ V+L++YRG V ++ N+ASQ G T+ NY +L DL +Y + GL+I Sbjct: 312 ASSIYDFTVKDTYGNDVSLEQYRGHVVLIVNIASQCGLTKNNYKKLTDLREKYGDKGLKI 491 Query: 98 LAFPCNQFGKQEPGSNEE-----IKEFAAGYNVKFDMFSKICVNGDDAHPLWKWMKIQPK 152 L FPCNQFG Q P S+ E +++ A D+F K+ VNG +A PL++++K K Sbjct: 492 LNFPCNQFGSQMPESDGEPMVCHLRDAKADIG---DVFQKVDVNGANAAPLYQYLK--AK 656 Query: 153 GKGILGNAIKWNFTKFLIDKNGCVVKRYGPMEEPLVIEKDL 193 G L +AIKWNFTKFL++K G VKRY P +P+ I KD+ Sbjct: 657 QGGTLVSAIKWNFTKFLVNKEGIPVKRYAPTTDPMDIAKDI 779 >gi|28564458|emb|AJ547804.1|IRI547804 Ixodes ricinus mRNA for glutathione peroxidase (gluper1 gene) Length = 914Score = 88.6 bits (218), Expect = 1e-17 >gi|12958610|gb|AF321612.1|AF321612 Venturia canescens virus-like particle protein mRNA, complete cds Length = 861 SECIS elements Multiple alignment As seen in the tblastn results related to the sequences above,although having good E-values, there is no evidence of matching the human"sec"( Stop codon).Even though, we realized that the flanking aminoacids were matching but in different frames.After performing exonerate analysis, no results were obtained.In order to overcome this problems, we proceded to find protein sequences using FASTACHUNK and TRANSLATE TOOL. Regarding to the common aminoacid sequence found in the rest of genomic fragments, we concluded that our protein sequence was divided in two frames, so we adapt our candidate proteins with fragments coming from different frames and we submit them to multiple alignment analysis. Arguments explaining this fact will be given in discussion section. If you want to see the ClustalW aligment, click here. 15 KDa tblasn >gnl|Dmov-agencourt-run0811-asm|contig_1424 2e-22 42% >gnl|Dvir-agencourt-run1029-asm|contig_3670 1e-21 39% >gnl|Dana-agencourt-040714-asm|contig_267 2e-21 51% >gnl|Dere-agencourt-run1028-asm|contig_376 2e-21 51% >gnl|Dyak-washu-assembly_040407|Contig17.16 6e-21 53% >gnl|Dsim-washu-w501-asm|Contig3419.1 6e-21 53% >XR_group6 type=chromosome; loc=XR_group6:1..6604477; ID=XR_group6;release=r1.03; species=dpse 6e-20 42% >gi|24666044|ref|NM_140743.1| Drosophila melanogaster CG7484-PB(CG7484)mRNA 5e-27 45% >gi|58386839|ref|XM_315090.2| Anopheles gambiae str.PEST ENSANGP00000011457 2e-32 44% >gi|18389880|gb|AF457547.1| Anopheles gambiae selenoprotein mRNA,partial cds 1e-31 44% >gi|42765457|gb|AY440428.1| Armigeres subalbatus ASAP ID:40327 selenoprotein mRNA 3e-31 43% >gi|42763702|gb|AY431560.1| Aedes aegypti ASAP ID:35705 selenoprotein mRNA 8e-30 45% >gi|48094317|ref|XM_394140.1| Apis mellifera similar to CG7484-PB(LOC410663),mRNA 3e-20 38% Gene prediction >gi|18389880|gb|AF457547.1| Anopheles gambiae selenoprotein mRNA, partial cds Score = 134 bits (337), Expect = 1e-31 Identities = 67/152 (44%), Positives = 95/152 (62%), Gaps = 4/152 (2%) Frame = +1 Query: 15 LRLLLATVL--QAVSAFGAEFSSEACRELGFXXXXXXXXX-XXXGQFNLLQLDPDCRGCC 71 +RL T L V+ GAEFS+E CRELG + L++L C CC Sbjct: 1 MRLFAITCLLFSIVTVIGAEFSAEDCRELGLIKSQLFCSACSSLSDYGLIELKEHCLECC 180 Query: 72 QEEAQFETK-KLYAGAILEVCGXKLGRFPQVQAFVRSDKPKLFRGLQIKYVRGSDPVLKL 130 Q++ + ++K K+Y A+LEVC K G +PQ+QAF++SD+P F L IKYVRG DP++KL Sbjct: 181 QKDTEADSKLKVYPAAVLEVCTCKFGAYPQIQAFIKSDRPAKFPNLTIKYVRGLDPIVKL 360 Query: 131 LDDNGNIAEELSILKWNTDSVEEFLSEKLERI 162 +D+ G + E LSI KWNTD+V+EF +L ++ Sbjct: 361 MDEQGTVKETLSINKWNTDTVQEFFETRLAKV 456 >gi|42765457|gb|AY440428.1| Armigeres subalbatus ASAP ID: 40327 selenoprotein mRNA Score = 132 bits (333), Expect = 3e-31 Identities = 68/157 (43%), Positives = 100/157 (63%), Gaps = 2/157 (1%) Frame = +3 Query: 8 CLVPAFGLRLLLATVLQAVSAFGAEFSSEACRELGFXXXXXXXXX-XXXGQFNLLQLDPD 66 C++ F L L+ V+ AEF+++ CRELGF G++ L +L Sbjct: 105 CIMNKFLLSLIPVLVI-VFQHTKAEFTTKDCRELGFIESQLFCSSCDTLGEYGLDELKDH 281 Query: 67 CRGCCQEEAQFETKKL-YAGAILEVCGXKLGRFPQVQAFVRSDKPKLFRGLQIKYVRGSD 125 CR CCQ++A+ K + Y A+LEVC K G +PQ+QAF++SD+P+ F L IKYVRG D Sbjct: 282 CRECCQKDAESSGKLMVYPKAVLEVCTCKFGVYPQIQAFIKSDRPQKFPNLTIKYVRGLD 461 Query: 126 PVLKLLDDNGNIAEELSILKWNTDSVEEFLSEKLERI 162 P++KL+D++GN+ E LSI KWNTD+V+EF +L ++ Sbjct: 462 PIVKLMDESGNVKETLSITKWNTDTVQEFFETRLTKV 572 >gi|42763702|gb|AY431560.1| Aedes aegypti ASAP ID: 35705 selenoprotein mRNA Score = 128 bits (321), Expect = 8e-30 Identities = 61/134 (45%), Positives = 89/134 (66%), Gaps = 2/134 (1%) Frame = +1 Query: 31 AEFSSEACRELGFXXXXXXXXX-XXXGQFNLLQLDPDCRGCCQEEAQFETKKL-YAGAIL 88 AEF+++ CR+LGF G++ L +L CR CCQ++ + K + Y A+L Sbjct: 196 AEFTAKDCRDLGFIKSQLYCSSCGTLGEYGLDELKDHCRECCQKDVESTGKLMVYPKAVL 375 Query: 89 EVCGXKLGRFPQVQAFVRSDKPKLFRGLQIKYVRGSDPVLKLLDDNGNIAEELSILKWNT 148 EVC K G +PQ+QAF++SD+P+ F L IKYVRG DP++KL+D+ GN+ E LSI KWNT Sbjct: 376 EVCTCKFGAYPQIQAFIKSDRPQKFPNLTIKYVRGLDPIVKLMDEAGNVKETLSITKWNT 555 Query: 149 DSVEEFLSEKLERI 162 D+V+EF +L ++ Sbjct: 556 DTVQEFFETRLTKV 597 >gi|48094317|ref|XM_394140.1| Apis mellifera similar to CG7484-PB (LOC410663), mRNA Score = 96.7 bits (239), Expect = 3e-20 Identities = 53/136 (38%), Positives = 77/136 (56%), Gaps = 2/136 (1%) Frame = +1 Query: 26 VSAFGAEFSSEACRELGFXXXXXXXXXXXXGQFNLLQLDPDCRGCCQEEAQFETK--KLY 83 V+ EFS++ C+ LGF + NLL C ++ K Y Sbjct: 22 VNIVSTEFSADDCKSLGF------------NKANLL---------CSTYDDYDASGLKRY 138 Query: 84 AGAILEVCGXKLGRFPQVQAFVRSDKPKLFRGLQIKYVRGSDPVLKLLDDNGNIAEELSI 143 A+LEVC K G +PQ+QAF++S++P ++ LQIKYVRG DP++KL D + + + L I Sbjct: 139 PRAVLEVCTCKFGAYPQIQAFIKSNRPNKYKNLQIKYVRGLDPIIKLFDADNKVEDILDI 318 Query: 144 LKWNTDSVEEFLSEKL 159 KW+TDSV+EFL+ L Sbjct: 319 HKWDTDSVDEFLATHL 366 SECIS elements Multiple alignment TR3 As in GPx family, after doing tblastn similarity search,all three TR proteins were found to match with the same DNA sequence in each of the species. Consequently, we decided to perform a multiple sequence aligment followed by an unrooted N-J tree as we did with GPx family.Inthis case we included all human TR proteins and the protein we found for each of the species. tblastn As we can notice in the previous figures all the proteins that we have studied are closer to TR3 than to any other human TR selenoprotein.Even though, this fact doesn't mean that all proteins found in insect genomes correspond with human TR3 function. Anyway, we will use this protein for gene prediction and multiple alingment, only TR3 tblastn results will be shown. tblastn >XR_group8 type=chromosome; loc=XR_group8:1..9190824; ID=XR_group8; release=r1.03; species=dpse e-149 54% >gnl|Dvir-agencourt-run1029-asm|contig_2338 e-147 53% >gnl|Dmov-agencourt-run0811-asm|contig_5619 e-146 53% >gnl|Dana-agencourt-040714-asm|contig_394 e-146 53% >gnl|Dyak-washu-assembly_040407|Contig131.3 e-145 53% >gnl|Dana-agencourt-040714-asm|contig_2492 e-142 53% >XL_group3a type=chromosome; loc=XL_group3a:1..2686958; ID=XL_group3a; release=r1.03; species=dpse e-140 50% >gnl|Dmov-agencourt-run0811-asm|contig_2707 e-139 53% >gnl|Dvir-agencourt-run1029-asm|contig_3163 e-138 53% >gnl|Dyak-washu-assembly_040407|Contig34.25 e-137 52% >gnl|Dere-agencourt-run1028-asm|contig_5904 e-101 52% >gnl|Dere-agencourt-run1028-asm|contig_278 1e-62 52% >gnl|Dsim-washu-w501-asm|Contig11.144 1e-29 46% >gnl|Dsim-washu-w501-asm|Contig2066.1 1e-19 50% >gi|33089107|gb|AY329357.1| Apis mellifera ligustica thioredoxin reductase (Trxr-1) mRNA, e-153 54% >gi|58380391|ref|XM_310514.2| Anopheles gambiae str. PEST ENSANGP00000017329 e-153 56% >gi|1848293|gb|U88187.1|MDU88187 Musca domestica glutathione reductase family member mRNA, e-150 53% >gi|27819972|gb|BT003266.1| Drosophila melanogaster LD21729 full insert cDNA e-147 53% >gi|42764077|gb|AY431890.1| Aedes aegypti ASAP ID: 42805 thioredoxin reductase mRNA sequence e-131 55% >gi|50897530|gb|AY625511.1| Glossina morsitans morsitans clone Gmm2366 putative thioredoxin e-101 50% >gnl|Dana-agencourt-040714-asm|contig_1097 3e-30 43% >gnl|Dvir-agencourt-run1029-asm|contig_992 5e-30 42% >4_group4 type=chromosome; loc=4_group4:1..6604331;ID=4_group4;release=r1.03; species=dpse 6e-30 40% >gi|24581818|ref|NM_135053.2| Drosophila melanogaster CG3887-PA (CG3887) mRNA, complete cds 2e-37 43% >gi|42765836|gb|AY440807.1| Armigeres subalbatus ASAP ID: 42399 putative: selenoprotein T mRNA. 1e-36 44% >gi|42761900|gb|AY433028.1| Aedes aegypti ASAP ID: 36405 putative: selenoprotein T mRNA. 7e-34 48% >gi|58393304|ref|XM_319979.2| Anopheles gambiae str.PEST ENSANGP00000016703 8e-33 45% Gene prediction For the sequences that exonerate could not have been performed, tblastn outputs are shown: gnl|Dyak-washu-assembly_040407|Contig131.3(contig is not correct, exonerate can not be performed) Length = 18324 Score = 518 bits (1333), Expect = e-145 Identities = 261/486 (53%), Positives = 333/486 (68%), Gaps = 2/486 (0%) Frame = -1 Query: 38 DYDXXXXXXXXXXXACAKEAAQLGRKVAVVDYVEPSPQGTRWGLGGTCVNVGCIPKKLMH 97 DYD ACAKEAA G +V DYV+P+P GT+WG+GGTCVNVGCIPKKLMH Sbjct: 14415 DYDLVVLGGGSAGLACAKEAAGCGARVLCFDYVKPTPVGTKWGIGGTCVNVGCIPKKLMH 14236 Query: 98 QAALLGGLIQDAPNYGWEVA-QPVPHDWRKMAEAVQNHVKSLNWGHRVQLQDRKVKYFNI 156 QA+LLG + +A YGW V Q + DWRK+ +VQNH+KS+NW RV L+D+KV+Y N Sbjct: 14235 QASLLGEAVHEAVAYGWNVDDQNLRPDWRKLVRSVQNHIKSVNWVTRVDLRDKKVEYVNS 14056 Query: 157 KASFVDEHTVCGVAKGGKEIL-LSADHIIIATGGRPRYPTHIEGALEYGITSDDIFWLKE 215 SF D HT+ VA G E ++++++++A GGRPRYP I GA+E GITSDDIF + Sbjct: 14055 MCSFRDSHTIEYVAMPGAENRQVTSEYVVVAVGGRPRYPD-IPGAVELGITSDDIFSYER 13879 Query: 216 SPGKTLVVGASYVAWECAGFLTGIGLDTTIMMRTSPLRGFDQQMSSMVIEHMASHGTRFL 275 PG+TLVVGA YV ECA FL G+G + T+M+R+ LRGFD+QMS ++ M G FL Sbjct: 13878 EPGRTLVVGAGYVGLECACFLKGLGYEPTVMVRSIVLRGFDRQMSELLAAMMTERGIPFL 13699 Query: 276 RGCAPSRVRRLPDGQLQVTWEDSTTGKEDTGTFDTVLWAIGRVPDTRSLNLEKAGVDTSP 335 P V R DG+L V + ++TT K+ + FDTVLWAIGR LNLE AGV T Sbjct: 13698 GTTIPKAVERQADGRLLVRYHNTTTQKDGSDVFDTVLWAIGRKGLIEDLNLEAAGVKTHD 13519 Query: 336 DTQKILVDSREATSVPHIYAIGDVVEGRPELTPTAIMAGRLLVQRLFGGSSDLMDYDNVP 395 D KI+VD EATSVPHI+A+GD++ GRPELTP AI++GRLL +RLF GS+ LMDY +V Sbjct: 13518 D--KIVVDGAEATSVPHIFAVGDIIYGRPELTPVAILSGRLLARRLFAGSTQLMDYADVA 13345 Query: 396 TTVFTPLEYGCVGLSEEEAVARHGQEHVEVYHAHYKPLEFTVAGRDASQCYVKMVCLREP 455 TTVFTPLEY CVG+SEE A+ G +++EV+H +YKP EF + + CY+K V Sbjct: 13344 TTVFTPLEYSCVGMSEETAIELRGADNIEVFHGYYKPTEFFIPQKSVRHCYLKAVAEVSG 13165 Query: 456 PQLVLGLHFLGPNAGEVTQGFALGIKCGASYAQVMRTVGIHPTCSEEVVKLRISKRSGLD 515 Q +LGLH++GP AGEV QGFA +K G + ++ TVGIHPT +EE +L I+KRSG D Sbjct: 13164 DQKILGLHYIGPVAGEVIQGFAAALKSGLTVKTLLNTVGIHPTTAEEFTRLSITKRSGRD 12985 Query: 516 PTVTGC 521 PT C Sbjct: 12984 PTPASC 12967 SECIS elements gnl|Dvir-agencourt-run1029-asm|contig 2338: 297811 297912 ACCCUGCAGC UUGAAAA GUUGUCUGUCCAU AUGAU AAGACAGGUAGA AA GAAAGCUGGCGAAAC UUCGAUUGUCUU UGAA AACAC AUUUGCA CAAGGCUAUU COVE score: 9.87 (Recommended threshold for COVE: 15) Multiple alignment If you want to see the ClustalW alignment, click here. Unrooted N-J tree 4. Discussion The computational analysis as an attempt to decipher genomic contents emerges as a powerful tool, permiting new genes to be found. New programmes for gene prediction, similarity search systems and other tools are being developped and continously improved in a way that we can combine the ones that best fit to solve our specific problem. Our project was aimed at determining the presence or absence of selenoproteins in different insect genomes by using the various computational analysis resources that are nowadays available.
After having performed an exhaustive analysis by the methods mentioned above, we report evidence of candidate selenoproteins existing in insect genomes. THE SPS2 FAMILY THE H SELENOPROTEIN THE R SELENOPROTEINS THE T SELENOPROTEINS THE GPX FAMILY In this family 21 cys-homologous proteins in 15 different species were reported, all of them, according to the phylogenetic analysis, related to the human Gpx4. Results coming from tblastn were quite significant, and some of them were further confirmed in the exonerate analysis were the human "Sec"matching Cys residues were placed in-frame. The multiple alignment showed a high conservation level, even in those sequences that could not have been submitted to exonerate. As it was mentioned in the results section, although four of the genomic fragments had no reliable results after doing tblastn and exonerate, we reconstructed the proteins from two different frame sequences and submitted them to multiple aligment analysis. This revealed that our proteins are very well conserved among them and also towards the other sequences, what is more significant. This fact suggests that we are in front of four genes needing a frameshift to encode cys-homologous proteins. Thus, we speculate that these genes could have suffered an alternative 3' splice site. Even though, we could be facing a simple sequentiation mistake. Our results also revealed another significant fact. Most of Drosophila species showed more than one cys-homologous protein: Although all of these proteins are more closed to Gpx4 than to any other Gpx human protein, we can speculate that different cys-homologous proteins found for each species performe different functions knowing that at least, they are encoded by different genomic sequences. Multiple sequence aligment followed by N-J unrooted tree suggested that the different cys-homologous proteins found in most of Drosophila species come from a duplication process. This is proved by the fact that in those species having more than one homologous protein, each of them is located in a different tree cluster. D.ananassae is supossed to have suffered two duplication process, while species showing just one cys-homologous protein are supossed not to have suffered any, and they are clustered together. Independently of our results we have to bear in mind that not all the genomes that we have worked with are completely sequenced but D.melanogaster and D.pseudoobscura, so these results can not be considered as definitives. A duplication event could drive to two different functional proteins as it has been mentioned above, but not necessarily. As it is known a duplication could yield a pseudogene, so unless performing more exhaustive and even experimental analysis, it will be impossible to confirm whether our cys-homologous candidates are functional or not. Only two SECIS structures were found by SeciSearch and due to obtained scores, only the one corresponding to D.ananassae can be considered significant. Presence of SECIS in cys-homologous proteins represents an unusual case that is more deeply discussed at the end of this section.. THE 15KDA SELENOPROTEIN THE TR FAMILY The candidate selenoproteins found in this project represent probably a small sample of all the ones widespread in different organisms, so further genomic analysis need to be performed in order to establish their real importance. Appart from the obvious conclusions mentioned above, the unusual cases found make some evolutionary questions arise. The most likely is that proteins having Cys aminoacids suffered mutations that enabled their downstream 3'UTR region to form a SECIS structure. 5. References ARTICLES:
6. Acknowledgements We thank greatfully all our lecturers for making the world of Bioinformatics more accessible to us, specially Charles Chapple for supervising our job and having the patience of bearing us. We greatly thank the opportunity of discovering this world, a world that we will be pleased not to visit again to the end of our days. Thank you for making us work so hard, to the point that our girlfriends and boyfriends broke up for feeling abandoned. Thank you for making us smoke again, a habit that we managed to give up but we went back to, during this project. We greatly thank the greasy sandwiches from "la cafeteria del hospětal" for feeding us with high vitaminic content food, essential for the highly demanding brain work that we had to perform. Finally, we thank our families for their unconditional support. |