Page 1 of 1

query regarding annotation submission report

Posted: Thu Apr 04, 2013 3:59 pm
by pgupta
Hi

For the last section of the annotation report where we perform a BLASTP search using the predicted amino acid sequence against the non-redundant protein database (nr) for regions of the project with gene predictions that do not overlap with putative orthologs identified in the BLASTX track, we get a hit of 1e-28.The gene that comes up in the blast result belongs to D.ananassae but blastx alignment on the browser does not list such a gene.The fosmid we are working on is D. ananassae Jan. 2013 (GEP/3L Reference) fosmid_2728H07 and we blasted the first gene predicted by the genscan gene prediction software.

Since it is a D.ananassae gene that comes up in the hit and we are blasting the protein sequence from a fosmid belonging to same species shouldn't the e value be zero as it should be a perfect match? if this is not an adequate explanation then how do we justify such a hit?

Thanks
Paromita

Re: query regarding annotation submission report

Posted: Thu Apr 04, 2013 7:14 pm
by wleung
As part of the initial analysis of the 12 Drosophila genomes, the Drosophila 12 Genomes Consortium generated a set of gene predictions (GLEAN-R) for all the Drosophila species. These predictions are in the NCBI RefSeq database with the prefix XM_ (mRNA) or XP_ (protein). The XM and XP prefixes indicate that these sequences are computational predictions that have not been experimentally confirmed. In contrast, the D. melanogaster RefSeq sequences have the prefix NM_ (mRNA) or NP_ (protein).

Consequently, it is not surprising for other gene predictors (e.g. GENSCAN) to predict the same open reading frames when they analyze the same region of the genome. Given the low accuracy of gene predictors, unless the prediction is supported by RNA-seq evidence or it is conserved in other species, we cannot infer the correct gene structure or the different isoforms that might be present. If there are no other evidence that supports this gene model, I would not annotate the feature as a gene. However, please include a screenshot of the blastp alignment to the D. ananassae GLEAN-R prediction in the annotation report.


> Since it is a D. ananassae gene that comes up in the hit and we are blasting the protein sequence from a fosmid belonging to same species
> shouldn't the e value be zero as it should be a perfect match?

While we expect the sequence identity to be 100%, the E-value might be higher than 0. The E-value depends on the length of the alignment and the degree of sequence similarity. Short alignments are more likely to occur by chance so it will have a higher E-value. In addition, if GLEAN-R gene model is derived from another part of the genome but have weak similarity to the gene prediction in your fosmid, then the E-value would be substantially higher and the level of sequence identity would also be substantially lower.