good digest, (some) bad read pairs

Ask questions about sequence improvement / finishing D. mojavensis projects here.
Post Reply
cjones
Posts: 99
Joined: Sun Feb 04, 2007 10:19 pm
Location: Moravian College, Bethlehem PA
Contact:

good digest, (some) bad read pairs

Post by cjones » Thu Feb 07, 2013 2:38 am

A couple of projects have gone well, and the digests match very well for all four enzymes. However, there are a number of inverted red triangles indicating the corresponding mate pairs are too far apart (in this case). One possibility is that there is an uncollapsed repeat in there making the region artefactually large, but I don't think that all the reads in the region behave that way. How likely is it that multiple read pairs are genuinely farther apart than normal?
Chris Jones
Assoc. Prof. of Biology
Moravian College
Bethlehem PA

wleung
Posts: 185
Joined: Sun Feb 04, 2007 7:41 pm
Location: Washington University in St. Louis

Re: good digest, (some) bad read pairs

Post by wleung » Thu Feb 07, 2013 3:40 am

If all four restriction digests matched and there are no issues with the assembly in this region (e.g. high quality discrepancies, unaligned high quality, etc), then I would not be too concerned about the inconsistent mate pairs. In particular, if the inconsistent mate pairs are in the correct relative orientation (i.e. pointing toward each other) and are just slightly larger than the maximum insert size, then the mate pairs are probably placed correctly.

However, even though the digests are consistent (i.e. within the 2% size difference threshold between the real and in-silico bands), I would examine the fragments in the four restriction digests carefully to make sure that they are approximately the same size. Specifically, if the in-silico fragments in all four digests are consistently larger than the corresponding real fragments, then I would examine this region more closely to make sure that the region is assembled correctly.

You could also pull both mate pairs out of the main contig and search them against the whole genome assembly (using FlyBase BLAST or BLAT at the UCSC Genome Browser) to see if they might match better elsewhere in the assembly.

> How likely is it that multiple read pairs are genuinely farther apart than normal?
By default, Consed uses the distances between all the mate pairs in the entire project to create a distribution of the insert sizes. I believe the maximum insert size is defined as 2.5x of the standard deviation + the mean insert size. Assuming the distribution follows a normal distribution (i.e. z-score), this yields a probability of 0.994, which means that 99.4% of the insert sizes are expected to be below this threshold.

To be more precise, you can obtain the mean and standard deviation of the insert sizes for your project from the Consed main window (under Info -> Show Library Info). Then you can calculate the z-score (and p-value) for each of the inconsistent insert sizes you have observed. Multiply these probabilities to obtain the probability that you will observe the multiple inconsistent mate pairs.

cjones
Posts: 99
Joined: Sun Feb 04, 2007 10:19 pm
Location: Moravian College, Bethlehem PA
Contact:

Re: good digest, (some) bad read pairs

Post by cjones » Mon Feb 11, 2013 8:44 pm

Since you say to "pull both mate pairs out of the main contig and search them against the whole genome assembly" I assume there's no simple way to copy a read sequence to paste elsewhere? I don't see anything in Consed that looks like it, and neither chromat_dir nor phd_dir have anything in them that look workable either....
Chris Jones
Assoc. Prof. of Biology
Moravian College
Bethlehem PA

wleung
Posts: 185
Joined: Sun Feb 04, 2007 7:41 pm
Location: Washington University in St. Louis

Re: good digest, (some) bad read pairs

Post by wleung » Tue Feb 12, 2013 1:24 am

I do not believe there is an easier way to export the sequences of individual reads using Consed. However, in addition to pulling the reads out and exporting the consensus, there are two additional ways you can use to obtain the sequence of individual reads in a Consed database:

1. Use a trace viewer (e.g. 4Peaks on Mac OS X) to view and export the sequence

2. Use the phd2fasta script

As part of the Consed installation, there is a script in /usr/local/genome/bin called phd2fasta which will convert phd files into fasta and quality files.

Step 1: Create a new directory to store all the phd files of interests
Step 2: Copy the phd files from the phd_dir to the new directory
Step 3: Convert the phd files to fasta files using phd2fasta

You can use the graphical interface (e.g. Finder on the Mac) to accomplish steps 1 and 2.

Example: obtain fasta sequences for the phd files 25181711K20.b1.phd.1 and 25181711K20.g1.phd.2

Code: Select all

# Inside edit_dir of the project directory

# create new directory to store all the phd files of interests
mkdir check_traces

# copy phd files of interests to the new directory
cp ../phd_dir/25181711K20.b1.phd.1 ../phd_dir/25181711K20.g1.phd.2 ./check_traces/

# create fasta sequence files
phd2fasta -id ./check_traces/ -os 25181711K20_sequences.fasta
The command above will analyze all the phd files in the check_traces directory and put the sequences into a file called 25181711K20_sequences.fasta in the current directory.

Post Reply