Alignment and Proximity-Ligation QC

At step Removing PCR duplicates you used the flag –output-stats, generating a stats file in addition to the pairsam output (e.g. –output-stats stats.txt). The stats file is an extensive output of pairs statistics as calculated by pairtools, including total reads, total mapped, total dups, total pairs for each pair of chromosomes etc’. Although you can use directly the pairtools stats file as is to get informed on the quality of the TopoLink library, we find it easier to focus on a few key metrics. We include in this repository the script get_qc.py that summarize the paired-tools stats file and present them in percentage values in addition to absolute values.

The images below explains how the values on the QC report are calculated:

_images/1.Aligning.png _images/2.Cis_trans_valids.png

Command:

python3 ./TopoLink/get_qc.py -p <stats.txt>

Example:

python3 ./TopoLink/get_qc.py -p stats.txt

After the script completes, it will print:

Total Read Pairs                              10,000,000     100%
Unmapped Read Pairs                           325,358       3.25%
Mapped Read Pairs                             8,129,270    81.29%
PCR Dup Read Pairs                            218,657       2.82%
No-Dup Read Pairs                             7,847,613    78.48%
No-Dup Cis Read Pairs                         5,616,943    71.58%
No-Dup Trans Read Pairs                       2,230,670    28.42%
No-Dup Valid Read Pairs (cis >= 1kb + trans)  5,929,881    75.56%
No-Dup Cis Read Pairs < 1kb                   1,917,732    24.44%
No-Dup Cis Read Pairs >= 1kb                  3,699,211    47.14%
No-Dup Cis Read Pairs >= 10kb                 3,221,239    41.05%

Complexity

If you performed a shallow sequencing experiment (e.g. 10M reads) and running a QC analysis to decide which library to use for deep sequencing (DS), it is recommended to evaluate the complexity of the library before moving to DS.

The lc_extrap utility of the preseq package aims to predict the complexity of sequencing libraries.

preseq options:

Parameter

Value

Function

bam

Specifies that the input file type is bam. Please note that for a bam file to be a recognized input file htslib sould be installed as well and preseq should be built with htslib support (for more details see preseq documentation or our installDep.sh script as example)

pe

Specifies that paired end data is used

extrap

2.10E+09

Maximum extrapolation

step

1.00E+08

Extrapolation step size

seg_len

1000000000

maximum segment length when merging paired end bam

output

output file

Please note that the input bam file should be a version prior to dups removal.

preseq lc_extrap command example for extrapolating library complexity:

Command:

preseq lc_extrap -bam -pe -extrap 2.1e9 -step 1e8 -seg_len 1000000000 -output <output file> <input bam file>

Example:

preseq lc_extrap -bam -pe -extrap 2.1e9 -step 1e8 -seg_len 1000000000 -output out.preseq mapped.PT.bam

In this example the output file out.preseq will detail the extrapolated complexity curve of your library, with the number of reads in the first column and the expected distinct read value in the second column. For a typical experiment (human sample) check the expected complexity at 400M reads (to show the content of the file, type cat out.preseq). Expected unique pairs at 400M sequencing is at least ~ 125 million

_images/3.Complexity.png

QC Assessment

  • Pass/No Pass Metrics

    • No-Dup Cis Read Pairs >= 1kb – This value demonstrates that the proximity-ligation step was successful, and the majority of the data are useful in downstream analyses (e.g. loop calling).

    • For Shallow QC Sequencing Complexity at 400M Read Pairs – This value informs how many unique reads a library can support.

    • For Deep - sequencing No-Dup Read Pairs

  • Pass/No Pass Values

    • The table below summarizes the minimum passing values for the metrics defined above. The cut-off values were determined for both shallow sequenced (10 million read pairs 2 x 150 bp) and deep sequenced data (200-300 Million read pairs 2 x 150 bp).

Metric

Shallow Sequencing

Deep Sequencing

No-Dup Cis Read Pairs >= 1kb

>40% of no-dup read pairs

>40% of no-dup read pairs

Complexity @ 400M Read Pairs

>125 million

NA

No-Dup Read Pairs

NA

>125 million

Sequencing Recommendations

TopoLink was designed to support looping calling with one sample. This requires generating four libraries from a single proximity-ligation reaction. This does not mean you need to sequence all four libraries. The amount of sequencing and the number of libraries you need to to sequence is dependent on the feature you are trying to detect and the resolution (or bin size) you wish to call features at. The table below outlines the number of libraries, total sequencing depth in read pairs, and how many read pairs are needed per library, and finally the minimal amount of no-dup read pairs summed across the libraries for each feature at given resolutions:

Feature

Resolution

Total # libraries

Total # read pairs

Total # read pairs per library

Minimal # of no-dup read pairs summed across libraries

A/B Compartments

50-100 kb

1

200 Million

200 Million

>80 Million

TADS

25 kb

2

400 Million

200 Million

>150 Million

10 kb

2

600 Million

300 Million

>300 Million

5 kb

4

800 Million

200 Million

>400 Million

Loops

10 kb

4

800 Million

200 Million

>400 Million

5 kb

4

1200 Million

300 Million

>500 Million

To generate the most complete matrix you can from a single 500 thousand cell input, you need sequence 4 libraries to a total of 1200 million read pairs (300 million per library).