Changes to the GEP Web Framework - Spring 2018

Change log for the Genomics Education Partnership Web Framework
Post Reply
wleung
Posts: 185
Joined: Sun Feb 04, 2007 7:41 pm
Location: Washington University in St. Louis

Changes to the GEP Web Framework - Spring 2018

Post by wleung » Sun Jan 14, 2018 7:45 am

Changes to the GEP Web Framework - Spring 2018

Hello Everyone,

The GEP Web Framework has been reset for the 2018 Spring semester. The key changes to the GEP Web Framework are summarized below:

1. Annotation projects for Spring 2018
  • 50 D. eugracilis F and D elements annotation projects remaining from Fall 2017
  • 65 new annotation projects from the D. takahashii F element
    • Please give priority to the D. eugracilis projects
2. Transcription Start Sites (TSS) projects for Spring 2018
  • 74 projects remaining from the D. biarmipes D element
  • 61 new TSS projects from the D. elegans F element
    • Reconciled gene models available through the "Reconciled Gene Models" track on the GEP UCSC Genome Browser [Jan. 2015 (GEP/Dot) assembly].
Note that this aspect of the project lags behind; if your students can do TSS annotation for their D. eugracilis genes, or help us with this backlog, it will be much appreciated. TSS annotation can be a good project for students who have completed your class, and are interested in an independent study to do more.

3. Sequence improvement projects for Spring 2018
  • 82 sequence improvement projects from D. biarmipes, D. elegans, D. ficusphila, and D. eugracilis remaining from Fall 2017
  • 32 new sequence improvement projects from the D. takahashii F element
4. Synchronize GEP annotation resources to FlyBase release 6.19
  • Updated GEP web framework tools (e.g., Gene Model Checker, Gene Record Finder)
  • Updated FlyBase genes, exons, and CDS tracks for the D. melanogaster Genome Browser
  • Updated "D. mel Proteins", "D. mel Transcripts", and "genBlastG Genes" evidence tracks for the D. biarmipes, D. elegans, D. ficusphila, and D. eugracilis projects
5. Updates to curriculum materials
  • Revised instructions and screenshots based on the new FlyBase 2.0 web site
  • Revised curriculum based on FlyBase release 6.19 and NCBI BLAST+ 2.7.1
6. New GEP UCSC Genome Browsers evidence tracks

D. melanogaster
  • Whole genome multiple sequence alignments of 26 Drosophila species
  • Transcription factor binding site predictions from JASPER
  • CAGE TSS clusters identified by Schor IE et al., 2017
  • Release 005 of the EPDnew promoter annotations
  • CAGEr analysis of Exo-Seq data produced by Afik S et al., 2017
  • Conserved domain annotations from Pfam and UniProt produced by the UCSC Bioinformatics Group
D. pseudoobscura
  • CAGEr analysis of the CAGE data produced by modENCODE

Detailed description of changes
Below is a more detailed description of the changes that we have made for Spring 2018:

1. Annotation projects for Spring 2018

There are 50 D. eugracilis projects (1 from the F element, 49 from the D element) remaining from Fall 2017. These projects have the highest priority in Spring 2018 (particularly the projects with no submissions). If your students have previously worked on the D. eugracilis annotation projects in Fall 2017, please submit the completed projects at your earliest convenience.

We have also created a new set of 65 annotation projects from the D. takahashii F element [Jan. 2018 (GEP/Dot) assembly] for the Spring 2018 semester. D. takahashii is at a similar evolutionary distance from D. melanogaster as the other species we have analyzed for this study.

Preliminary analysis of the D. takahashii assembly produced by the modENCODE project [April 2013 (BCM-HGSC/Dtak_2.0) assembly] identified 67 scaffolds that contain putative orthologs of D. melanogaster F element genes. To facilitate the data analysis, the D. takahashii F element scaffolds were combined into a single sequence (DtakGB2_F) based on the placements of coding exons, and synteny with D. melanogaster and D. biarmipes. The F element scaffolds in the Dtak_2.0 assembly are separated by 100bp gaps in the combined DtakGB2_F sequence, per the GenBank specification for gaps of unknown sizes.

The total size of the combined DtakGB2_F sequence is approximately 2.8 Mb, which indicates that the D. takahashii F element has undergone an expansion compared to the D. melanogaster F element (apparently due to increases in the densities of retrotransposons and helitrons). This DtakGB2_F sequence was partitioned into 65 overlapping annotation projects. The expansion of the D. takahashii F element results in the concomitant increase in the size of the D. takahashii annotation projects — with a median project size of 80 kb. Similar to the other Drosophila annotation projects, the estimated number of genes in the D. takahashii projects ranges from 1 to 5, with a median of 2 genes per project.

As a reminder, the D. takahashii annotation projects were derived from a draft assembly that has not been manually improved. Preliminary analysis identified consensus errors in 7 genes that interfered with the coding region annotations. These errors have been corrected prior to creating the D. takahashii projects.

However, there could be additional consensus errors in the D. takahashii F element that could impact gene annotations. Instructions on how to identify and document consensus errors are available through the "Sequence Updater User Guide" (available under "Help" -> "Documentations" -> "Web Framework"). Please include the evidence used to support the identification and correction of the putative consensus errors in the "Consensus sequence errors report form" section of the GEP Annotation Report.

The TSS section of the GEP Annotation Report is optional, so you can submit a project without TSS annotations. However, if time permits, we would like to encourage your students to annotate the TSS in their project after they have completed the coding region annotations.


2. TSS projects for Spring 2018

We are beginning to explore different types of motif discovery analyses that can be used to identify conserved regulatory motifs that enable the expression of F element genes within a heterochromatic environment. TSS annotations remain an essential part of the current GEP research project — and we would like to encourage you and your students to contribute to the TSS annotation projects.

In addition to the 74 TSS annotation projects from the D. biarmipes D element that were released in Fall 2017, we have added 61 TSS annotation projects from the D. elegans F element to the Project Management System. The reconciled gene models are available through the D. elegans [Jan. 2015 (GEP/Dot)] assembly on the GEP UCSC Genome Browser. We have two undergraduate students who will be working this Spring to reduce the backlog of projects that require TSS annotations. Depending on the progress of the TSS annotations, we might release an additional set of TSS projects for the D. elegans D element this Spring.

As a reminder, we have produced RNA Polymerase II (RNA PolII) ChIP-Seq data for D. biarmipes, D. elegans, and D. ficusphila. Regions that are significantly enriched in RNA PolII are identified by MACS2, and the evidence tracks are available under the "Expression and Regulation" section of the GEP UCSC Genome Browser (e.g., "RNA PolII Peaks", "RNA PolII Enrichment").

The TSS annotation protocol remains the same as previous years. The PowerPoint presentation and the recordings of the GEP webinars on TSS annotation are available on the "September 2016 GEP Webinars" page on the GEP Private Wiki. You can access one of the webinar recordings directly at https://wustl.adobeconnect.com/p7cuk86mya8/. The TSS lecture, walkthrough, and workflow are available on the "Beyond Annotation" section of the GEP web site.

In addition to the TSS curriculum in the Beyond Annotation section, two TSS introductory modules are currently under development by Meg Laakso (Eastern University) and Jamie Sanford (Ohio Northern University). These TSS modules will be an extension of the "Understanding Eukaryotic Genes" curriculum, and they are designed for beginning students. The modules will provide additional explanations of the different TSS evidence tracks available on the GEP UCSC Genome Browser for D. melanogaster (e.g., RAMPAGE, RNA PolII X-ChIP-Seq, CAGE, whole genome multiple sequence alignments). The two TSS modules should be ready for initial testing later this Spring.


3. Sequence improvement projects for Spring 2018

There are 82 sequence improvement projects remaining from Spring 2017 (4 from D. biarmipes, 16 from D. elegans, 29 from D. ficusphila, and 33 from the D. eugracilis F element). Based on the efforts of GEP students and two Washington University students working this Fall, gaps within these projects have either been resolved or they have been tagged as "doNotFinish" (when multiple attempts at PCR and sequencing have failed). Hence the primary focus for these sequence improvement projects would be to resolve errors within mononucleotide runs.

We have created 32 projects from the D. takahashii F element to help assess the quality of the reference-guided D. takahashii F element scaffold (DtakGB2_F) prior to creating the annotation projects. The gaps in these projects (with the project prefix "DTAK9999") still need to be resolved. If you would like to incorporate a wet-lab component into your course, please work on resolving the gaps within the projects from the D. takahashii F element.

As we have mentioned in the "Annotation projects for Spring 2018" section, the D. takahashii F element has undergone an expansion due to an increase in transposon density compared to D. melanogaster. The higher repeat density increases the possibility of misassemblies, which leads to additional challenges in resolving gaps in these sequence improvement projects. In addition, please note that the 100 bp gaps in the assembly piece (i.e. the "read" that spans the entire sequence improvement project) correspond to gaps with unknown size.

If you plan to install Consed on macOS, please note that Consed is incompatible with recent versions of X11 (XQuartz 2.7.10 or above). We have previously updated the "GEP Installation Package" page on the GEP Wiki with instructions on how to install an older version of XQuartz. There is also a temporary workaround if you cannot downgrade XQuartz (see the Consed " additional information after the 29.0 release" web page for details).


4. Synchronize GEP annotation resources to FlyBase release 6.19

FlyBase has published a new annotation release for D. melanogaster (6.19; FB2017_06) on December 31[sup]st[/sup], 2017. We have synchronized the GEP annotation tools to this annotation release for Spring 2018. The Gene Record Finder, the Gene Model Checker, the Annotation Files Merger, and the blastx reports in the annotation packages have been updated to FlyBase release 6.19. In addition, we have updated the protein alignments and gene prediction tracks (i.e. blastx protein alignments, SPALN transcript alignments, and genBlastG gene predictions tracks) on the GEP UCSC Genome Browser for the D. biarmipes, D. elegans, D. ficusphila, and D. eugracilis projects.

We have also updated the evidence tracks for the whole genome assemblies of the 26 Drosophila species available on the GEP UCSC Genome Browser so that they are consistent with FlyBase release 6.19. For D. melanogaster, we have updated the FlyBase Genes, Exons, and CDS tracks. For the other 25 Drosophila species, we have updated the BLAT alignments to D. melanogaster transcripts ("D. mel Transcripts"), SPALN alignments to D. melanogaster proteins ("D. mel proteins", and "genBlastG Genes"), and the tblastn alignments of individual coding exons ("CDS Mapping").


5. Updates to curriculum materials

In conjunction with the new annotation release, FlyBase has also redesigned their web site. The new FlyBase 2.0 web site is more mobile friendly, and it provides improvements to many FlyBase tools (e.g., HitLists, Protein Domains, Sequence Downloader). Please see the (FlyBase 2.0 commentary) and the video on the FlyBase YouTube channel for details.

As part of the transition to FlyBase 2.0, we encountered some stability issues with the new FlyBase web site (e.g., broken links, layout issues) that will hopefully be resolved during the next few weeks. Some pages (e.g., the FlyBase BLAST page) still need to be updated to the new layout. In addition, the new FlyBase web site was occasionally inaccessible (with a 504 Gateway Timeout error). If you cannot access the new FlyBase web site or have curriculum pieces that depend on the old FlyBase web site, you can access the previous version of the FlyBase web site at http://fb2017_05.flybase.org/.

While the overall organization of the FlyBase web site remains the same, the updated FlyBase web site necessitates revisions to most of the GEP annotation curriculum materials (e.g., updated screenshots, step-by-step instructions). The following curriculum pieces have undergone minor revisions to reflect changes to the FlyBase 2.0 web site, the new FlyBase annotation release, and changes to NCBI BLAST (NCBI BLAST+ 2.7.1) and the BLAST databases:
  • An Introduction to NCBI BLAST
  • Annotation for D. virilis
  • Annotation of a Drosophila Gene
  • Annotation of Conserved Motifs in Drosophila
  • Annotation of Drosophila (workshop presentation)
  • Annotation of Drosophila Primer
  • Annotation of Transcription Start Sites in Drosophila
  • Annotation Strategy Guide
  • Basics of BLAST
  • Behavior and Limitations of Motif Finding
  • Chimp BAC Analysis
  • Detecting and Interpreting Genetic Homology
  • GenBank Accession Number Reference Sheet
  • Introduction to ab initio and Evidence-based Gene Finding
  • Introduction to BLAST using Human Leptin
  • Introduction to web databases
  • List of Common Bioinformatics Programs
  • Overview of Multiple Sequence Alignment Algorithms
  • Quick check of student annotations
  • Searching for Transcription Start Sites in Drosophila
  • Simple Annotation Problem
  • Using mRNA and EST Evidence in Annotation
The following curriculum pieces have undergone major revisions due to changes to the FlyBase tools and the deprecation of old FlyBase tools:
  • Motif discovery in Drosophila
  • Using FlyBase RNA-Seq Tools to Investigate Gene Expression Profiles
If you have placed any of these documents on your own communication system for your students (e.g., Blackboard), please replace them with the updated version of the documents.


6. New GEP UCSC Genome Browsers evidence tracks

In preparation for the comparative analysis of conserved regulatory motifs, we have added multiple evidence tracks to the GEP UCSC Genome Browser. To facilitate the phylogenetic footprinting analysis, we have produced a multiple sequence alignment of 26 Drosophila species, followed by analyses using phastCons and PhyloP to identify conserved regions. These evidence tracks are available through the "Drosophila Conservation (26 Species)" composite track on the GEP UCSC Genome Browser for D. melanogaster (under the "Comparative Genomics" section).

Multiple TSS resources for D. melanogaster were published during the past year. One of the Washington University students will evaluate these new evidence tracks as part of a research project that compares the promoter architecture of F and D element genes. The new evidence tracks are available under the "Expression and Regulation" section of the D. melanogaster genome browser:
  • JASPAR 2018 TFBS
    • Predicted binding sites for the transcription factors in the JASPAR CORE insect collection (Khan A et al., 2017; PMID: 29161433)
  • Schor 2017 CAGE Peaks (R5)
    • CAGE TSS clusters based on the analysis of CAGE datasets for 81 D. melanogaster DGRP lines (Schor IE et al., 2017; PMID: 28191888). The results were lifted from the release 5 to the release 6 assembly.
  • EPDnew Promoters
    • Updated D. melanogaster promoter annotations (release 005) produced by the Eukaryotic Promoter Database (EPDnew). The new annotations include the results from Schor et al., 2017.
  • Combined Exo-seq TSS, Exo-seq Read Density, Exo-seq Tag Clusters
    • CAGEr analysis of the Exo-seq datasets produced by Afik S et al., 2017 (PMID: 28335028). The Exo-seq datasets include experimental results from four Zeitgeber times (ZTs) at three temperatures (18C, 25C, 29C).
We have also imported two conserved domains evidence tracks produced by the UCSC Genome Bioinformatics Group to the GEP D. melanogaster genome browser (see the "Pfam in RefSeq" and "UniProt" evidence tracks under the "Genes and Gene Prediction Tracks" section).

In addition, we have re-analyzed the modENCODE D. pseudoobscura CAGE data (from female and male carcasses, female ovaries, and male testes) using CAGEr. The CAGEr results are available through the D. pseudoobscura BCM-HGSC Dpse_3.0/DpseGB3 assembly under the "Expression and Regulation" section.


We hope that you find the resources above useful. Please let me know if you encounter any issues accessing these resources or if you have any questions. Thanks.

Sincerely,

Wilson

Post Reply