Annotation Questions

From GEP Wiki
Jump to: navigation, search

This page contains a list of common annotation questions and their answers. If you have questions on annotation, please ask them on the GEP bulletin board.

How can I get information on my region of genomic DNA?

There are two different web pages that have information on your region of DNA. The first is the genome browser at University California Santa Cruz referred to below as the Santa Cruz Browser. This browser has all the genomic sections organized by large contigs that were assembled by the genome centers that produced the data. The front page looks like this:

image:ucsc_splash.png

The second location GEP browser is a copy of the same browser that we created locally in which we put each small section as a separate piece and have several tracks not found on the Santa Cruz browser selected specifically to support manual annotation. The intro page looks like this:

image:goose_splash.png

Each site has different data, so you may wish to visit both and investigate the various analysis tracks. While both sites will have useful information the gander browser will need to be used for your final annotation steps since it is this site that has the correct base numbers. Thus when you are ready so search for exact splice sites you will want to go to the gander browser.

There are four common reasons to use the genome browser at UC Santa Cruz for annotating GEP projects. First, you decide to merge neighboring projects so that you can work on a contiguous sequence as a class. Second, you suspect there is a partial gene in the beginning or the end of your contig sequence and wish to confirm that the missing exons can be found upstream / downstream of your project sequence. Third, you wish to examine other chromosomes beside the dot chromosome of D. erecta. Fourth, you want to find corresponding regions in other Drosophila species.

How do I transform coordinates from the Santa Cruz Browser to the Gander Browser?

The easiest way to transform the coordinates is to use the project report file that is part of the annotation package you have previously downloaded. This project report file contains the coordinates of the D. erecta scaffold that correspond to the coordinates of the individual projects.

Another way to transform the coordinates is to use BLAT with our D. erecta projects searching against the D. erecta assembly at UC Santa Cruz:

  1. Navigate to our contig at gander (gander.wustl.edu)
  2. Extract the first 10kb of the contig sequence (use the DNA tool)
  3. Navigate to UCSC browser at UC Santa Cruz (genome.ucsc.edu)
  4. BLAT the extracted sequence against the D. erecta 2005 assembly
  5. There should be only one hit spanning all 10kb with 100% sequence identity on scaffold 4512.
  6. You can then determine the offset required based on the coordinate that correspond to position 1 in your sequence.

How do I obtain small regions of sequence from larger clones for downstream analysis?

There are both web based and command line methods for extracting a region of sequence from within a larger region.

Method 1 using Santa Cruz browser:

If the sequence you working with is on a Santa Cruz browser you can use the browser to extract the sequence: To do this, go the browser (either genome.ucsc.edu or gander.wustl.edu). Click “genome browser” on the top left. Select the proper clade, species and assembly for the clone from which you wish to extract sequence. Enter the name of the sequence in the position box followed by a colon and then add the coordinates of the bases you want. It might look something like this: Fosmid2:5000-9000 or chr3:123,456,789-123,456,999. Click submit. If the region you see on the browser is the correct region go ahead and click DNA at the top of the page. If the region you see is not the exact region, use standard navigation techniques until the browser is displaying the region you want to extract and then hit the DNA button at the top.

Once you hit the DNA button you will be taken to a page that will allow you to extract the sequence. At the top of the page is the “position” box; this should match the position you were just viewing on the browser. You can change the value here is it is incorrect. Use the entries below the position box to modify the sequence you wish to extract. Be aware that the default for the browser is to extract the sequence of the top strand (as shown in the browser), if you wish to obtain the sequence of the bottom strand be sure to check the “Reverse complement (get '-' strand sequence)”. There are also sophisticated sets of controls that can be set to add color to the bases if so desired.

Once the settings are the way you want them hit the "Get sequence" button. The next page will give you the sequence section that you asked for. From here you can select the sequence and paste into any sequence analysis web page or save the sequence to a file.

Method 2 EMBOSS

The EMBOSS package has a sequence extraction routine can extractseq. The EMBOSS package can be installed on your computer to allow you to run this routine at the command line. There are also web based interfaces to the EMBOSS package. You can find a list here.