"Unaligned high quality" must not mean what I think it means

Ask questions about sequence improvement / finishing D. mojavensis projects here.
Post Reply
cjones
Posts: 99
Joined: Sun Feb 04, 2007 10:19 pm
Location: Moravian College, Bethlehem PA
Contact:

"Unaligned high quality" must not mean what I think it means

Post by cjones » Thu Feb 14, 2013 2:02 am

Why does consed sometimes identify stretches as "unaligned high quality" but, upon inspection, it's a stretch of mixed upper- and lower-case bases, grey or black background, not at all my idea of "high quality"? I told the student dealing with this to simply pull that read (because there were many other bona fide high-quality reads covering the same region), but Consed insists in pulling out several dozen reads as a result. Why is this happening and is there some way around it besides converting a whole lot of bases to Ns?
Chris Jones
Assoc. Prof. of Biology
Moravian College
Bethlehem PA

wleung
Posts: 182
Joined: Sun Feb 04, 2007 7:41 pm
Location: Washington University in St. Louis

Re: "Unaligned high quality" must not mean what I think it m

Post by wleung » Thu Feb 14, 2013 3:35 am

Because the spurious unaligned high quality reads do not usually affect the consensus, the best way to resolve the issue is by either changing the low quality bases to N's or to add a comment tag indicating that the unaligned region is low quality. However, if you see green matchElsewhereHighQual tags in the read, then I would perform a Search for String using the tagged sequence to verify that the read and its mate pair have not been misplaced.

> I told the student dealing with this to simply pull that read ... Consed insists in pulling out several dozen reads as a result

When you pull a read out of the assembly, Consed will try to pull out other reads that have similar sequences. This is a convenient feature when dealing with misassemblies where you would need to pull out multiple discrepant reads. In most cases, the part of the read that is outside of the unaligned high quality region will have a high degree of sequence similarity with other reads and the consensus. Consequently, Consed will often pull out additional reads when you try to pull out the read with unaligned high quality regions.


> Why does consed sometimes identify stretches as "unaligned high quality" but, upon inspection, it's a stretch of mixed upper- and lower-case
> bases.

The minimum number of bases that would constitute an "unaligned high quality" region is controlled by the consed parameter consed.ignoreUnalignedHighQualitySegmentsShorterThanThis. By default, a minimum of 20 bases are required for a region to be classified as unaligned high quality. Note that, by default, the definition of high quality is set at 20 (Q20) in the context of the assembly but is 40 in the context of high quality discrepancies. The Q20 threshold is also used in the read depth graph in Assembly View.

Just like the read depth calculation, Consed uses the length of the maximal Q20 read segment to determine the length of the unaligned high quality region. The maximal read segment begins at the first base that is Q20 or above and ends at the last base that is Q20 or above (within the unaligned region). As you have noted previously, there could be bases within the read segment that are much lower quality.

Please refer to section 12.92 in the current Consed documentation for a more detailed explanation of maximal read segment.

Post Reply