Page 1 of 1
Posted: Thu Mar 05, 2015 10:45 pm
In our D. elegans projects, we have a lot of cases where the 454 reads are almost entirely one sequence and the Illumina are almost entirely another. In most cases it is 10 or 20 differences in a stretch of 100 or so while in a couple its a stretch of 30-100 that are completely different (which looks ugly-lots of spaces). Although a couple of these are in repeats and therefore are likely misplaced reads, others are not in repeats. I assume that we should just tag the ones not in repeats as polymorphisms, as it is related to the sequencer/library construction and we have no way to know which is "correct"?
Posted: Fri Mar 06, 2015 1:06 am
As you have mentioned, unaligned high quality regions will frequently appear within genomic repeats in the hybrid assembly projects. In some cases, you can select "Tell phrap to not overlap reads at this location
" at the discrepant positions and then perform a Miniassembly to create two separate contigs that are each consistent with their respective consensus sequence. You can then use "Search for String" or cross_match to see if you can place one of these contigs at a different location of the assembly. If the region cannot be resolved given the available data, I would add a "dataNeeded" tag to the consensus sequence in the unaligned high quality region.
> ... , others are not in repeats
The blue repeat tags are based on sequence similarity to an de novo
species-specific transposon library produced by RepeatModeler. RepeatModeler identifies repeats based on over-represented sequences and k-mers so the library does not contain all genomic repeats. You can run RepeatMasker
on the genomic regions without the "repeat" tags to determine if the region has sequence similarity to transposons in other Drosophila