Page 1 of 1

New genes

Posted: Wed Apr 23, 2014 1:56 am
by drevie
How do we use Gene model checker to check new genes (not in D. mel). When my students tried, GMC didn't like the name they used.

Re: New genes

Posted: Wed Apr 23, 2014 4:54 am
by wleung
For novel genes that are not present in D. melanogaster, your student should enter the species and the ortholog name that was used to construct the gene model in the "Ortholog in D. melanogaster" field of the Gene Model Checker configuration form. For example the GEP has previously identified a novel gene in D. virilis called GEP001 that is not present in D. melanogaster. To verify this gene model, you would enter Dvir-GEP001-PA in the "Ortholog in D. melanogaster" field.

The Gene Model Checker will issue a warning indicating that the ortholog cannot be found in D. melanogaster and it cannot produce the dot plot or the protein alignment. However, the Gene Model Checker will still validate the proposed gene model using the checklist and produce the transcript and peptide sequences for the proposed gene model. Hence you can use the "Align two or more sequences" functionality in NCBI BLAST to compare the proposed model against the putative ortholog in the reference species.

Please include a description of the novel gene in the GEP Annotation Report and the evidence used to justify the presence of a novel gene (e.g. RNA-Seq coverage, BLASTP results against the RefSeq protein database).

Re: New genes

Posted: Thu Apr 24, 2014 3:26 pm
by drevie
I don't believe there is any ortholog, although I told them to do blast searches to see. It was predicted by some of the prediction programs like Nscan. I assume that they can therefore just enter GEP001-PA that it will be accepted? Or should we enter Dbii-GEP001-PA?

Re: New genes

Posted: Thu Apr 24, 2014 3:28 pm
by drevie
I believe there was also RNAseq evidence for it.

Re: New genes

Posted: Thu Apr 24, 2014 4:46 pm
by wleung
Yes, you can use "Dbia-GEP001-PA" as the gene name when verifying the gene model using the Gene Model Checker. However, in general, I would not annotate a feature as a gene unless the feature shows significant sequence similarity to another known gene in the NCBI nr protein database or contains a conserved domain. This is because the accuracy of most gene predictors are only between 30-50%. RNA-Seq coverage in a region could correspond to other features besides protein coding genes (e.g. transposon fragments, non-coding RNA genes). In addition, we cannot reliably construct the alternate splicing pattern of a gene using just the RNA-Seq data and results from the gene predictors. Tools such as Cufflinks could use the RNA-Seq read coverage and TopHat junctions to predict different isoforms but accurate reconstruction of transcripts from RNA-Seq data remains an active area of research. Please refer to the following manuscript by Steijger T. et al. for more information:

Steijger T. et. al. Assessment of transcript reconstruction methods for RNA-seq. Nat Methods. 2013 Dec;10(12):1177-84.