Sequence Improvement Questions

From GEP Wiki
(Redirected from Finishing Questions)
Jump to: navigation, search

This page contains a list of common sequence improvement / finishing questions and their answers. If you have questions on finishing, please ask them on the GEP bulletin board.

How do I incorporate new data downloaded from GEP into my project?'

No matter which technique you use to add data (see below) you will need to copy the trace files into the chromat_dir of the project. You can use any file copy technique suitable for your computer. To re-analyze and incorporate the new the data into the assembly, you have two options: you can (1) use the phredphrap script to create a new assembly (ace file) or (2) use Consed's function "Add New Reads" to attempt to add the new reads to the current assembly.

How do I use phredphrap to incorporate new data into my project?

First copy the trace files into the chromat_dir found inside your project directory. Then start X11 and in the Xterm window, use the cd command to set the working directory to the edit_dir of the project. Enter the command phredphrap {Note that on some computers the command may actually be phredPhrap}. Wait while phred/phrap and the other scripts reanalyse the project, you should see lots of text scroll by on the screen. Once phredphrap has completed a new ace file will have been created, you can start consed and open the new ace file to continue your analysis.

How do I use consed "Add New Reads" to incorporate new data into my project?

First copy the trace files into the chromat_dir found inside your project directory. Also copy the *.fof file into the edit_dir of the project. Then start X11 and in the Xterm window, use the cd command to set the working directory to the edit_dir of the project. Start consed and open the most recent ace file. Click the "Add New Reads" button found on the main window. In the dialog box that appears select the downloaded .fof file. Click OK. Answer the two following question boxes appropriately (default answers for working with GEP data sets is YES for both questions). Consed will now attempt to incorporate the new reads into the existing assembly. Looking back in the original Xterm window you can see feedback as to the ongoing process of assembly. You will then be presented with two windows. The first is a navigation window of all reads added and where they were placed. The second is a report of the newly placed reads.

When incorporating new data, how should I choose between the phredphrap and the consed "Add New Reads" techniques?

When running phredphrap, all reads will be assembled de novo. This is usually fine to do and is the recommended technique. However, if there is a problem in the project that creates a mis-assembly, this mis-assembly will get re-created every time you run phredphrap. Thus, if you have an assembly in which you have made a force join or a force tear, it is best to add your data using consed and "Add New Reads" to avoid having to redo the force join (or tear) after adding your new data.

How do I identify the ends of my Drosophila grimshawi fosmid?

If you have an easy clone with a single large contig which is about as big as a fosmid (most fomids will have inserts between 35-42KB but a few will fall outide this range) the ends are obviously at the ends of the contig. If you have two or more contigs you will need to find the reads that mark the ends of the fosmid. Be aware that given the mix of fosmid and 2 kb reads it is possible that the orientation of the two contigs could be wrong in assembly view and should not be trusted, you must identify the ends that came from the fosmid clone using reads names to confirm orientation in assembly view.

How do Identify read names that came from the end of my fosmid?

For the D. grimshawi project there are two sources for the fosmid end reads, the original Agencourt reads done as part of the whole genome assembly and the reads done at the Genome sequencing center to confirm the identification of the fosmid template. All the fosmid reads from the Agencourt assembly have a three letter prefix (dga, dgb, dgc, etc) to separate them from the bulk sequencing reads. The reads will have a .b1 and .g1 extension added to the end of the read names to indicate which primer was used to generate the sequence, either forward (.b1) or reverse (.g1). For many projects there may also be one or two reads done by the GSC and incorporated into your projects which will mark the ends. These reads will start with DGAA- instead of DG and has an extra a appended to the project name.

For example, for the project DGA06H06, the fosmid end reads from the original Agencourt assembly are named: dga06h06.b1 and dga06h06.g1. The additional fosmid end reads from the GSC are named: DGAA-A06H06a.b1 and DGAA-A06H06a.g1.

How do I find a particular read in my assembly if I only know the read name?

The easiest way to find a read in your project is to use the search box in the Main Window of Consed. You can search for any read by entering your search term into the box labelled "Find reads containing (*'s allowed)". Simply type the search term and hit the return key. If the search term matches a single read an align reads window will open with the cursor set at the first base of the read you were searching for. If the search term matches more than one read then a small text window will open with a list of all reads that match and give the contig number and base position of the location of the read. You can simply click on the read you wish to view to again open an aligned reads window of the proper contig with the cursor placed at the first base of the read.

How do I get consed to join vector and insert to create the proper digest pattern?

When looking at the digests you may have noticed a large number of mis-matching bands. This is probably not due to a mis-assembly but instead is due to consed failing to link the insert and vectorin silico prior to digestion. This has the consequence that the in silico digests has two extra cut sites that are not really in the fosmid. To complicate matters even more there are some sites very close to the vector/insert boundary so these digests will look just fine for some enzymes and not others even though there is that added in silico cut site.

There is one thing you can do that has helped in many cases, consed does a better job of joining insert to vector if you select "entire single contig" instead of the default "Entire clone" in top right of the "Select Emzymes and Contigs" window. After you select "entire single contig" enter the number of the contig that is your insert. Select your enzymes as you normally would and click OK.

You may see an error message that starts like this:

"can't determine which end of vector is connected to the right end of ...

This is good because it means consed is trying to join the insert with the vector but it cannot find enough sequence overlap to know which end of the insert goes with which end of the vector. So it is warning you that consed is just going to guess.

If consed guesses correctly then your map should be correct. If consed guesses incorrectly then you will have two fragments that do not match, each incorrect fragment with have one end in the insert and one end in the vector. If this happens click "compl vector" this will flip the orientation of the vector with the insert and then everything should match.