Page 1 of 1

What is calc_total_fragSizes.pl doing?

Posted: Thu Feb 07, 2013 12:39 am
by cjones
Running the calc_total_fragSizes.pl script on the project I'm looking at returns a *wide* range of sizes:

#Enzyme Total_size
EcoRI 83126
EcoRV 48115
HindIII 41444
SacI 49103

yet the total size of the fragments summed up from Consed's digest panels is far less: 38832 nt. What exactly is the script doing here? I knew going in that there are misassemblies galore in this project, but I can't interpret this result.

Re: What is calc_total_fragSizes.pl doing?

Posted: Thu Feb 07, 2013 1:47 am
by wleung
The calc_total_fragSizes.pl script parses the file which contains the observed fragment sizes in the real digests (i.e. fragSizes.txt) and calculates the sum of the fragment sizes for each of the real digests. We expect the real digest sizes to be approximately the same for all four restriction digests (where the total size = vector size + insert size). Note that the vector pcc01.fasta has a total size of 8,139bp.

By comparing the total sizes of the digests, we can identify extra or missing bands in the real restriction digests. For example, in the output you have provided, we can infer the following based on the total digest sizes:

1. According to the EcoRV and SacI digest, the total size of the insert is ~ 41kb:
Total fragment size of SacI - size of vector: (49103 - 8139 + 1) = 40966

2. The real EcoRI digest likely contain large bands that are spurious. The total amount of extra data in the real digest is ~34kb
83126 - 49103 + 1 = 34,024

3. Real HindIII digest is missing ~7kb (often caused by uncalled doublets)
49103-41444 + 1 = 7660

Analysis of the output of calc_total_fragSizes.pl script allow us to ascertain if the discrepancies between the real and in-silico digests are genuine. In the example above, we know that there is likely a discrepancy of ~7kb between the real and the in-silico HindIII digest that could be attributed to uncalled genuine bands in the real digest.

When dealing with projects with major misassemblies, we can use the real digest to determine if there is a substantial amount of data missing from the project. If all four real restriction digests have approximately the same total size and they all indicate extra data compared to the in-silico, then the project is likely missing data and we would need to retrieve additional reads from the NCBI Trace Archive by searching for unpaired reads or by performing blastn searches against the D. ananassae WGS database at the Trace Archive.