Changes to the GEP Web Framework - Fall 2017

Change log for the Genomics Education Partnership Web Framework
Post Reply
wleung
Posts: 179
Joined: Sun Feb 04, 2007 7:41 pm
Location: Washington University in St. Louis

Changes to the GEP Web Framework - Fall 2017

Post by wleung » Wed Aug 30, 2017 6:01 am

Changes to the GEP Web Framework - Fall 2017

Hello Everyone,

The GEP Web Framework has been reset for the 2017 Fall semester. The key changes to the GEP Web Framework are summarized below:

1. Annotation projects for Fall 2017
  • 3 D. ficusphila Muller D element projects and 2 D. eugracilis F element projects remaining from Spring 2017
  • 75 new annotation projects from the D. eugracilis D element
  • Revised GEP Annotation Report Form based on feedback from GEP Alumni Workshops
2. Transcription Start Sites (TSS) projects for Fall 2017
  • 74 new TSS projects from the D. biarmipes D element
    • Reconciled gene models available through the "Reconciled Gene Models" track on the GEP UCSC Genome Browser [Jan. 2014 (GEP/3L Control) assembly].
  • New RNA PolII ChIP-Seq data for D. elegans and D. ficusphila
  • Revised TSS Report Form based on feedback from GEP Alumni Workshops
3. Sequence improvement projects for Fall 2017
  • 68 sequence improvement projects from D. biarmipes, D. elegans, D. ficusphila, and D. eugracilis remaining from Spring 2017
  • 21 new sequence improvement projects from the D. eugracilis D element
4. Synchronize GEP annotation resources to FlyBase release 6.16
  • Updated GEP web framework tools (e.g., Gene Model Checker, Gene Record Finder)
  • Updated FlyBase genes, exons, and CDS tracks for the D. melanogaster Genome Browser
  • Updated "D. mel Proteins", "D. mel Transcripts", and "genBlastG Genes" tracks on the GEP UCSC Genome Browser for the D. biarmipes, D. elegans, D. ficusphila, and D. eugracilis projects.
5. Updates to curriculum materials
  • New RNA-Seq lectures and video; walkthrough on using the FlyBase RNA-Seq tools
  • GEP YouTube channel at http://gep.wustl.edu/youtube
  • Revised curriculum based on FlyBase release 6.16 and NCBI BLAST+ 2.7.0
6. New GEP UCSC Genome Browsers
Below is a more detailed description of the changes that we have made for Fall 2017:

1. Annotation projects for Fall 2017
There are 3 D. ficusphila D element projects and 2 D. eugracilis F element projects remaining from Spring 2017. These projects have the highest priority in Fall 2017.

Because most of the D. ficusphila and D. eugracilis projects have at least two submissions, they are no longer available to be claimed after we reset the Project Management System. If your students have completed annotated projects in Fall 2016 and Spring 2017 that you would like to submit, please send me an Email with the list of projects. We will then add these projects to your account so that you can submit them through the Project Management System.

We have also created a new set of 75 projects from the D. eugracilis D element [Aug. 2017 (GEP/3L Control)] for Fall 2017. These projects are derived from two scaffolds (KB465075 and KB465337) that have been putatively assigned to the base of the D. eugracilis D element.

As a reminder, the D. eugracilis annotation projects were derived from a draft assembly that has not been manually improved. Previous analyses of the D. eugracilis F element scaffolds have identified consensus errors in 22 genes (26 coding exons) that interfered with the annotations of the coding regions (e.g., causes frame shifts within coding exons, incompatible splice sites). For the D. eugracilis D element projects, the automated analysis pipeline identified 3 genes with consensus errors. We have corrected these errors prior to creating the D. eugracilis annotation projects. However, there could be additional consensus errors in the D. eugracilis D element that affect the gene annotations.

Instructions on how to identify and document consensus errors are available through the "Sequence Updater User Guide" (available under "Help" -> "Documentations" -> "Web Framework"). Please include the evidence used to support the identification and correction of the putative consensus errors in the "Consensus sequence errors report form" section of the GEP Annotation Report.

The TSS section of the GEP Annotation Report is optional, so you can submit a project without TSS annotations. However, if time permits, we would like to encourage your students to annotate the TSS in their project after they have completed the annotation of the coding regions.

We have revised the GEP Annotation Report Form based on the feedback from the June and July GEP Alumni Workshop working groups. In addition to minor revisions to the text, we have also changed the layout settings of the Word document to mitigate the layout issues that some students have encountered when they insert screenshots into the GEP Annotation Report. Inserting the screenshots into the color boxes in the revised GEP Annotation Report Form will constrain the position and the size of the image, thereby avoiding unexpected changes in the placement of the images and surrounding text.


2. TSS projects for Fall 2017

TSS annotations remain an essential part of the current GEP research project, and we would like to encourage you and your students to contribute to the TSS annotation projects. Because many classes have not had time to complete the TSS of the genes they otherwise annotated, we have a large number of projects with reconciled coding region annotations that still require TSS annotations. Based on the efforts of GEP students and two Washington University students working this summer, we have completed the TSS annotation of the D. biarmipes F element. We have begun preliminary analyses using MEME and Tomtom to identify motifs associated with D. melanogaster and D. biarmipes F element genes, and will continue with these analyses this Fall.

For Fall 2017, we will begin the TSS annotation of 74 projects from the D. biarmipes D element. The reconciled gene models are available through the "Reconciled Gene Models" evidence track on the GEP UCSC Genome Browser [D. biarmipes Jan. 2014 (GEP/3L Control) assembly]. Depending on the progress of the TSS annotations for the D. biarmipes D element, we might release additional TSS projects from D. elegans and D. ficusphila this Fall.

We have previously produced RNA Polymerase II (RNA PolII) ChIP-Seq data for D. biarmipes. To facilitate the TSS annotations of the other Drosophila species, we have produced RNA PolII ChIP-Seq data for D. elegans and D. ficusphila this summer. For these two species, there are RNA PolII ChIP-Seq data from two biological replicates, and each biological replicate has two technical replicates. Regions that are significantly enriched in RNA PolII were identified by MACS2. The RNA PolII evidence tracks are available under the "Expression and Regulation" section of the GEP UCSC Genome Browser (e.g., "RNA PolII Peaks", "RNA PolII Enrichment").

Based on the feedback from the summer 2017 Alumni Workshops, we have clarified several items in the TSS Report Form (e.g., evidence that supports the TSS annotation). Similar to the GEP Annotation Report Form, we have changed the layout of the TSS Report Form and added boxes to the Word document where students can insert screenshots. Inserting images into these boxes will reduce unexpected changes in the placement of images and text.

The TSS annotation protocol remains the same as that used during Fall 2016 and Spring 2017. The PowerPoint presentation and the recordings of the GEP webinars on TSS annotation are available on the "September 2016 GEP Webinars" page on the GEP Private Wiki. You can access one of the webinar recordings directly at https://wustl.adobeconnect.com/p7cuk86mya8/. The TSS lecture, walkthrough, and workflow are available on the "Beyond Annotation" section of the GEP web site.


3. Sequence improvement projects for Fall 2017

There are 68 sequence improvement projects remaining from Spring 2017 (4 from D. biarmipes, 17 from D. elegans, 35 from D. ficusphila, and 12 from the D. eugracilis F element). Based on the efforts of the GEP students and a Washington University student working this summer, gaps within these projects have either been resolved or they have been tagged as "doNotFinish" (when multiple attempts at PCR and sequencing have failed). Hence the primary focus for these sequence improvement projects would be to resolve errors within mononucleotide runs. Help with these projects would be appreciated.

We have created 21 projects from the D. eugracilis D element to help assess the quality of the genome assembly prior to creating the annotation projects. The gaps in these projects (with the project prefixes "DEUG5075" and "DEUG5337") still need to be resolved. If you would like to incorporate a wet-lab component into your course, please work on resolving the gaps within the projects from the D. eugracilis D element.

If you plan to install Consed on macOS, please note that Consed is incompatible with recent versions of X11 (XQuartz 2.7.10 or above). We have previously updated the "GEP Installation Package" page on the GEP Wiki with instructions on how to install an older version of XQuartz. There is also a temporary workaround if you cannot downgrade XQuartz (see the " Additional information after the 29.0 release was completed" page on the Consed web site for details).


4. Synchronize GEP annotation resources to FlyBase release 6.16

The Gene Record Finder, the Gene Model Checker, the Annotation Files Merger, and the blastx reports in the annotation packages have been updated to FlyBase release 6.16. We have also updated the protein alignments and gene prediction tracks (i.e. blastx protein alignments, SPALN transcript alignments, and genBlastG gene predictions) on the GEP UCSC Genome Browser for the D. biarmipes, D. elegans, D. ficusphila, and D. eugracilis projects.

On August 22, 2017, FlyBase released version 6.17 of the D. melanogaster gene annotations, which contains a total of 13,932 protein-coding genes (30,494 isoforms). Compared to release 6.16, this new release added 13 gene models, deleted 2 gene models, and changed the names of 19 protein-coding genes. A preliminary analysis indicates that these changes did not affect the gene records on the D. melanogaster F element or the base of the D element. Hence, we have decided to use FlyBase release 6.16 for the Fall semester. Please let us know if you encounter any major discrepancies with the D. melanogaster gene annotations or the curriculum materials that are caused by the new FlyBase release.


5. Updates to curriculum materials

As part of the professional development sessions for the GEP Alumni Workshops, we developed two lectures and a walkthrough on RNA-Seq. Dr. Leocadia Paliulis (Bucknell University) has created a short video that describes the RNA-Seq read mapping algorithm. The RNA-Seq curriculum is available through the "Beyond Annotation" section of the GEP web site:
  • RNA-Seq: a Closer Look at Read Mapping
  • RNA Quantitation from RNA-Seq Data
  • Using FlyBase RNA-Seq Tools to Investigate Gene Expression Profiles
The following curriculum materials have either been developed or revised by GEP faculty members:
  • Understanding Eukaryotic Genes (Meg Laakso et al., Eastern University)
  • Generating Multiple Sequence Alignments with Clustal Omega (Susan Parrish, McDaniel College)
  • Using Consed Graphically (Emily Furbee, Washington and Jefferson College)
The following curriculum materials have been revised for Fall 2017 based on feedback from GEP faculty, and to account for changes in FlyBase and NCBI:
  • An Introduction to NCBI BLAST
  • Annotation Strategy Guide
  • Annotation for D. virilis
  • Annotation of Conserved Motifs in Drosophila
  • Annotation of Drosophila (workshop presentation)
  • Annotation of Drosophila Primer
  • Annotation of Transcription Start Sites in Drosophila
  • Annotation of a Drosophila Gene
  • Behavior and Limitations of Motif Finding
  • Chimp BAC Analysis
  • Detecting and Interpreting Genetic Homology
  • GEP Annotation Report
  • GEP TSS Report
  • GenBank Accession Number Reference Sheet
  • Introduction to BLAST using Human Leptin
  • Introduction to ab initio and Evidence-based Gene Finding
  • Introduction to Web Databases
  • List of Common Bioinformatics Programs
  • Motif discovery in Drosophila
  • Searching for Transcription Start Sites in Drosophila
  • Simple Annotation Problem
  • Using mRNA and EST Evidence in Annotation
If you have placed any of these documents on your own communication system for your students (e.g., Blackboard), please replace them with the updated version of the documents.

We have created a new YouTube channel for hosting GEP videos and playlists (available at http://gep.wustl.edu/youtube). The YouTube channel currently contains the playlists for the "Understanding Eukaryotic Genes" modules, the lectures on Hidden Markov Models, the Genome Center Video Tour, and the Next Generation Sequencing Video Tour. Please let us know if there are other videos that you would like to add to the GEP YouTube channel.


6. New GEP UCSC Genome Browsers

In collaboration with Dr. Nate Mortimer at Illinois State University, we are beginning a pilot project to identify and annotate genes involved in lipid synthesis in several parasitoid wasp species. Additional details for this project are available on the " Parasitoid Wasp Project" page on the GEP Private Wiki. To support this project, we have created a UCSC Assembly Hub for four wasp species (Nasonia vitripennis, Ganaspis species 1, Leptopilina boulardi, Leptopilina heterotoma). (N. vitripennis is the informant species for the three other wasp species.) The wasp genome browsers are available at http://gander.wustl.edu/wasps-browsers. These genome browsers include several types of evidence tracks: alignments to transcript and protein sequences from D. melanogaster and N. vitripennis, RNA-Seq read coverage and assembled transcripts, ab initio and evidence-based gene predictors (e.g., Geneid, N-SCAN, Augustus), and multiple sequence alignments of seven wasp species (Chain, Net, Multiz, phastCons, and phyloP). These genome browsers were created by the G-OnRamp project.

As part of the discussions during the GEP Alumni Workshops on how to expand the types of annotation projects offered by the GEP, several faculty members have expressed interests in the comparative analyses of other regions of the Drosophila genome, or of genes involved in a specific pathway. Unfortunately, FlyBase and the official UCSC Genome Browser only provide genome browsers for the 12 Drosophila genomes that were sequenced by the Drosophila 12 Genomes Consortium. In addition, with the exception of D. melanogaster, the genome assemblies on the official UCSC Genome Browser are based on older drafts of the assemblies [i.e. before Comparative Analysis Freeze 1 (CAF1)].

To address these limitations, the genome browsers for 26 Drosophila species are now available through the GEP UCSC Genome Browser. In addition to the D. melanogaster reference genome, the staff at the NCBI RefSeq database selected 25 species as representative genomes for the different clades in Drosophila. These representative genomes include the 11 Drosophila species sequenced by the Drosophila 12 Genomes Consortium, the 8 Drosophila species sequenced by modENCODE, and D. arizonae, D. busckii, D. miranda, D. navojoa, D. serrata, and D. suzukii.

NCBI has produced gene predictions for these genomes using the NCBI Eukaryotic Genome Annotation Pipeline. These gene models were based on evidence from RNA-Seq, protein sequence similarity, and results from the Gnomon gene predictor. These gene predictions include multiple isoforms and untranslated regions. These gene predictions are available through the "Gnomon Genes" track on the GEP UCSC Genome Browser. We have also added a search index so that you can search for individual gene predictions by name or by description.

To build the genome browsers, each Drosophila genome assembly was analyzed using RepeatScout to construct a species-specific de novo repeat library. Each species-specific repeat library was combined with the Drosophila repeats in the RepBase library, and the combined library was used with RepeatMasker to identify transposon remnants in each genome. We also used k-mer based (TRF, tantan, WindowMasker, Tallymer), structure-based (LTRharvest), and profile-based (TransposonPSI) repeat finders to identify repetitive sequences within each genome assembly.

The genome browsers include alignments of each assembly against release 6.16 of the D. melanogaster proteins (using SPLAN), transcripts (using BLAT), and coding exons (using tblastn; see the "CDS Mapping" track). The genome browsers also include results from several ab initio and evidence-based gene predictors (genBlastG, GeMoMa, Genscan, Geneid, Augustus, SNAP, and GlimmerHMM), as well as tRNA and rRNA predictions from tRNAscan-SE and RNAmmer.

To assist in the identification of orthologous regions among these 26 Drosophila species, we have produced Chain and Net whole genome alignments of the 25 Drosophila genomes against release 6 of the D. melanogaster genome assembly [Aug. 2014 (BDGP Release 6 + ISO1 MT/dm6)]. The alignments are available through the "Drosophila Chain/Net" composite track on the GEP UCSC Genome Browser for D. melanogaster (under "Comparative Genomics").

We are in the process of creating the whole genome multiple sequence alignments for these Drosophila species, followed by the phastCons and phyloP analyses to identify conserved regions. We also plan to incorporate a subset of the RNA-Seq data available for each species into the genome browsers. Please let me know if there is a specific RNA-Seq dataset, or an additional Drosophila species that you would like to have added to the GEP UCSC Genome Browser.


We hope that you find the resources above useful. Please let me know if you encounter any issues accessing these resources or if you have any questions. Thanks.

Sincerely,

Wilson

Post Reply