High number HQDs - only one run matches consensu

Ask questions about sequence improvement / finishing D. mojavensis projects here.
Post Reply
dpaetkau
Posts: 29
Joined: Fri Jun 05, 2009 6:18 pm

High number HQDs - only one run matches consensu

Post by dpaetkau » Tue Sep 23, 2014 2:12 pm

Hello,

We are encountering a interesting phenomenon in some projects - a spot where every run except one disagrees with the consensus.
At first we thought that it was a Baylor mistake - the only run that has the different base is the one named for the project name, and so it just means the original consensus had the error - we find this a few times.
However, on closer inspection, it seems that in some, the Baylor agrees with the majority of runs and that it is a few 454 funs that agree with the consensus, but by far, the majority of both 454 and Illumina runs disagree with the consensus.
We have, of course, changed the consensus, but we are wondering why this occurs. Especially in the second case.
I would tell you the project name and send pictures, but those were not included in my student's email. We will be having a short lecture on how to email me problems today, so if you need that info, I will post it tonight.

Don
2. However

wleung
Posts: 182
Joined: Sun Feb 04, 2007 7:41 pm
Location: Washington University in St. Louis

Re: High number HQDs - only one run matches consensus

Post by wleung » Tue Sep 23, 2014 3:25 pm

We constructed the Consed assemblies by mapping the Illumina and 454 reads against the published consensus sequence. The protocol we used to create the ace file does not change the consensus because we would like the consensus for the Consed assembly to match the published assembly. This allows us to more easily document the changes that were made by the GEP compared to the published consensus in the version 2 assembly.

However, this means that the reads that aligned to the assembly does not necessarily correspond to the reads that were used to produce the consensus sequence. Particularly within repetitive regions, the 454 and Illumina reads could have been misplaced in the assembly. The mapping protocol has to tolerate some mismatches to account for base calling errors and indels. In addition, cross_match does not take the paired-end information into account when it aligns the 454 and Illumina reads against the consensus. This means that there is a higher rate of misplaced reads in the Consed assembly compared to the published whole genome shotgun assembly.

dpaetkau
Posts: 29
Joined: Fri Jun 05, 2009 6:18 pm

Re: High number HQDs - only one run matches consensu

Post by dpaetkau » Tue Sep 23, 2014 5:00 pm

Hello Wilson,

I completely understand the first paragraph and that makes sense that we want to easily identify changes.



It seems to me that your last line suggests that the consensus is more reliable (since it takes into account end reads, for example) than the Consed arrangement. My problem is a theoretical one. If almost all (we are talking about 95-99 out of 100 reads), are giving the same base, and only the consensus is giving a different base, how did they get the consensus in the first place. I know that there might be some mismapping of some reads, but both projects are using the same reads, so it seems a bit strange that all of the reads mapping to this area have a different base. So is it just a mistake in copying, or is there some other reason for what seems to be a glitch.

Don

wleung
Posts: 182
Joined: Sun Feb 04, 2007 7:41 pm
Location: Washington University in St. Louis

Re: High number HQDs - only one run matches consensu

Post by wleung » Tue Sep 23, 2014 6:32 pm

As you may know, Baylor uses a custom pipeline to create the version 2 of the D. biarmipes assembly. The output of the pipeline did not include the placement of the 454 and Illumina reads that were used to construct the consensus. As a first approximation, we mapped all the genomic reads against the published assembly and then extract the subset of reads that mapped to each project region to create the Consed projects.

Because the whole genome assembly and read mapping are two very different computational problems, it would not be surprising for the two approaches to produce different consensus sequences. In particular, because of the short read lengths, repetitive sequences are placed randomly in the assembly if there are multiple locations in the assembly that would yield the same optimal alignment score. Most assemblers generally ignore reads that mapped to a large number of contigs while the read mappers have higher tolerance for multi-mapped reads. In addition, assemblers do not guarantee that they will report the optimal assembly (because it is too computationally expensive). They also tend to propagate errors because it builds the contigs into larger scaffolds through an iterative process.

I should also note that the original assembly is based on 454 reads and the Illumina reads are used only for polishing. The polishing pipeline will only change the consensus based on a very stringent threshold. (The GATK pipeline is originally designed for identifying sequence variants). Hence if there are 454 reads that support the consensus at that position, it might not have reached the threshold required for the custom GATK pipeline to change the consensus sequence. In the absence of the read placement information, we would need to generate additional Sanger sequencing data to determine the correct base call.

dpaetkau
Posts: 29
Joined: Fri Jun 05, 2009 6:18 pm

Re: High number HQDs - only one run matches consensu

Post by dpaetkau » Thu Sep 25, 2014 2:29 pm

It seems to me that your explanation is also the reason that we are not focusing on HQDs in non-MNR regions. My students (and I will admit that I too am guilty), have got a bit caught up in these problems. They are both interesting and frustrating. What I hear you telling us in your last post is that there really isn't enough information to correct HQDs that are not in MNRs. Too many variables from the different way the projects were done.

So, ignore these problems and move on.

Is this a correct assessment of the problem. It means that are analysis of the sequence is very rudimentary (only MNRs unless we can get PCR to work) and we have to ignore a lot of things that look like errors.

Don

wleung
Posts: 182
Joined: Sun Feb 04, 2007 7:41 pm
Location: Washington University in St. Louis

Re: High number HQDs - only one run matches consensu

Post by wleung » Thu Sep 25, 2014 5:31 pm

> It seems to me that your explanation is also the reason that we are not focusing on HQDs in non-MNR regions.

Yes. Unfortunately, we do not have any external evidence (e.g., restriction digests, long high quality reads) that would convince a reviewer that the edits that we have made to the assemblies are correct. As you have mentioned, there are a large number of these potential consensus errors in each assembly and it would be too expensive and time consuming to use Sanger sequencing to verify each potential error. You could ask your students to add a comment tag to all of these potential errors if you prefer (i.e. this step is not required for project submission). Depending on how these errors affect the motif discovery algorithms, we may run additional Sanger reactions later to cover a subset of these problem regions that overlap with an annotated transcript.

The problem with mononucleotide runs (MNRs) in 454 assemblies is a special case because the error profile is well known in the field. In addition, because the indels within MNRs often disrupt the open reading frame of coding regions, this provides us with an additional line of evidence that we can use to support the hypothesis of an error in the consensus sequence.

Post Reply