Chimp Chunk Tutorial

From GEP Wiki
Jump to: navigation, search

Annotation Curriculum: Chimp Chunks practical considerations

(Faculty/TA only page)


The first two steps in annotation of any primate genomic sequence are to run RepeatMasker and to analyze the masked sequence using a de novo gene finder (we use Genscan). We have found that even using a fairly powerful server, the repeatmasker analysis can take a class of 12 over 1 hour of class time to complete. To avoid wasting valuable in-class time, we prepare folders in which selected segments have already been masked and analyzed by Genscan. This allows the students to begin analysis right away while the instructors and the TA are available. See below for links to prepared packages.

Although the students do not know it, each segment is picked to have at least one pseudogene and one fairly well annotated gene. Other interesting features also appear in many of the segments.

Students are not graded on how close to the current official annotation they get, but more on their ability to analyze their segment and make arguments to support their own conclusions. However, the current official annotation of each segment is available on at the Santa Cruz genome browser and can certainly assist graders.

Generating New Packages

If you want to avoid using the same package year over year, packages are actually quite easy to generate. You can always use the genome browser and simply look around for interesting regions. If you want to have a pseudogene in your segment, the easiest approach is to go to and download the list of chimpanzee pseudogenes, or you can use the search function to find regions with known pseudogenes. The dataset lists pseudogenes by chromosome number, position and type (processed or duplicated) and includes other data helpful for cross reference; the coordinates can be entered into your genome browser to view the region of interest (be sure to match the version of the assemblies at the two web pages as chimpanzee chromosomes have been renumbered between recent assembly versions). Take a look at the region around the pseudogene to see if there are sufficient other features to make the region suitable. If so, you can use the browser to extract the genomic sequence. This can then either be given to the student for masking and analysis or the complete package of files can be created prior to distribution. If you do find other regions that work well, we encourage you to post them below for others to use.

We find “processed” pseudogenes (as indicated in the pseudogene dataset) to be easier to identify than “duplicated” pseudogenes and only include processed ones in our packages.

Available Packages

Packages with posted answers

There are three packages which include both the original material and a set of "answers". They are available on the main GEP web site under curriculum materials.

They are not posted here to ensure that students cannot find the "answers" by searching the web. 

Packages with no posted answers

The following packages can also be used. These packages come with only the starting material and have been used at Washington University, however we have not posted any student generated analysis of these packages: