Page 1 of 1

Where the Xs are placed.

Posted: Mon Feb 20, 2012 5:38 pm
by dpaetkau
I have a number of students that are questioning their end sequence and where to put tags.
1. New end reads they ordered come back. There are now high quality discrepancies at the end for usually 3 bases as the new sequence has X's starting 3 bases further out then the original traces.

2. I have the screen shot from below for another project.
Screen shot 2012-02-20 at 3 36 41 PM.png
Screen shot 2012-02-20 at 3 36 41 PM.png (29.33 KiB) Viewed 12559 times
Okay, I have something confused. I thought that sice the fosmidendwg2420E14c.b1 read had the X's, and it was the fosmid end clone, that the Xs would be in the correct spot and I should X everything else. However, when I Blast the bases from 1-215, I get a very nice match to D. ananassae mRNA. This would suggest that the end read is not vector sequence but real ananassae sequence.
Okay, just did one more check. When I look at the trace I realize that it does not match the consensus at all. it matches where there are no X's, but not in the region that there are Xs. I am obviously missing something. Can you please tell me what I am missing.

Re: Where the Xs are placed.

Posted: Mon Feb 20, 2012 9:46 pm
by wleung
Basically, in order to determine the subset of WGS reads that belong to the fosmid project, we identify the locations of the end sequences that correspond to the fosmid and extract all the reads that are placed within this range. Depending on the orientation of the sequencing reads, part of the read may map within this range and then extend beyond the end of the fosmid.

Because these reads are part of the D. ananassae whole genome shotgun (WGS) assembly, the consensus derived from these reads would match the D. ananassae assembly in a BLAST search and could correspond to a transcribed region.

Below is a diagram that illustrate the issue:
Reads extend beyond fosmid ends
fosmid_end_reads_explanation.png (48.47 KiB) Viewed 12554 times

Re: Where the Xs are placed.

Posted: Mon Feb 20, 2012 11:07 pm
by dpaetkau
So, if I understand it correctly, then the fosmidend read determines the placement of the vector sequence and we should place the end of the contig based on the fosmidend read. So, is it okay to X out all of the other reads at the same place? Or, should you just place the end of the contig at that spot and leave the reads as they are?

Re: Where the Xs are placed.

Posted: Mon Feb 20, 2012 11:30 pm
by wleung
Yes, the fosmid end read determines the placement of the vector sequence and you can change all the bases that extend beyond the end of the fosmid to x's and then re-run phredPhrap. The only exception is when there are multiple inconsistent forward reverse mate pairs near the fosmid ends (which may indicate a misassembly).

Re: Where the Xs are placed.

Posted: Tue Feb 21, 2012 11:29 am
by dpaetkau
Thank you for your help Wilson. One last question. As we have waited for reads, students have fixed the high quality discrepancies. If we rerun PhredPhrap, won't we lose all of their work? I have been telling them to add reads when they come in, and not rerun PhredPhrap. Is this incorrect? Specifically I am asking when we rerun Phredphrap, do the tags remain, the oligo tags and the changes from high quality to low quality sequence .

Re: Where the Xs are placed.

Posted: Tue Feb 21, 2012 1:39 pm
by wleung
In general, you will not lose any of your work (edits, tags, etc) when you re-run phredPhrap. The only exceptions are tears and joins, where phrap will likely make the same mistake when you reassemble the project. As part of the phredPhrap script, the program will examine the most recent ace file and transfer any tags on the consensus to the new assembly (using the script transferConsensusTags.perl). Because the tag transfer process relies on the alignment of the old consensus with the new consensus, this transfer process may not work perfectly and some of the tags may be lost in projects with major misassemblies.

For the individual reads, the edits and tags are stored in the phd files (inside the phd_dir directory). Any time you make an edit or add a tag to a trace, a new phd file is created with all the changes you have made to the trace. The ace file is actually a database that references specific versions of the phd files. When you reassemble a project with phredPhrap, it first determines if a phd file for the trace is already present. If the phd file is available, it would use the most recent version of the phd file (which contains tags and edits) when generating the new assembly. If not (i.e. for the new traces), the phredPhrap script will call phred to generate the corresponding phd file.

In general, unless your project has major misassemblies, I would recommend running phredPhrap instead of add new reads. Please refer to the Sequence Improvement Questions page on the GEP wiki for more information.