Alignment and Proximity-Ligation QC

At step Removing PCR duplicates you used the flag –output-stats, generating a stats file in addition to the pairsam output (e.g. –output-stats stats.txt). The stats file is an extensive output of pairs statistics as calculated by pairtools, including total reads, total mapped, total dups, total pairs for each pair of chromosomes etc’. Although you can use directly the pairtools stats file as is to get informed on the quality of the TopoLink library, we find it easier to focus on a few key metrics. We include in this repository the script get_qc.py that summarize the paired-tools stats file and present them in percentage values in addition to absolute values.

The images below explains how the values on the QC report are calculated:

Command:

python3 ./TopoLink/get_qc.py -p <stats.txt>

Example:

python3 ./TopoLink/get_qc.py -p stats.txt

After the script completes, it will print:

Total Read Pairs                              10,000,000     100%
Unmapped Read Pairs                           325,358       3.25%
Mapped Read Pairs                             8,129,270    81.29%
PCR Dup Read Pairs                            218,657       2.82%
No-Dup Read Pairs                             7,847,613    78.48%
No-Dup Cis Read Pairs                         5,616,943    71.58%
No-Dup Trans Read Pairs                       2,230,670    28.42%
No-Dup Valid Read Pairs (cis >= 1kb + trans)  5,929,881    75.56%
No-Dup Cis Read Pairs < 1kb                   1,917,732    24.44%
No-Dup Cis Read Pairs >= 1kb                  3,699,211    47.14%
No-Dup Cis Read Pairs >= 10kb                 3,221,239    41.05%

Complexity

If you performed a shallow sequencing experiment (e.g. 10M reads) and running a QC analysis to decide which library to use for deep sequencing (DS), it is recommended to evaluate the complexity of the library before moving to DS.

The lc_extrap utility of the preseq package aims to predict the complexity of sequencing libraries.

preseq options:

Parameter	Value	Function
bam		Specifies that the input file type is bam. Please note that for a bam file to be a recognized input file htslib sould be installed as well and preseq should be built with htslib support (for more details see preseq documentation or our installDep.sh script as example)
pe		Specifies that paired end data is used
extrap	2.10E+09	Maximum extrapolation
step	1.00E+08	Extrapolation step size
seg_len	1000000000	maximum segment length when merging paired end bam
output		output file

Please note that the input bam file should be a version prior to dups removal.

preseq lc_extrap command example for extrapolating library complexity:

Command:

preseq lc_extrap -bam -pe -extrap 2.1e9 -step 1e8 -seg_len 1000000000 -output <output file> <input bam file>

Example:

preseq lc_extrap -bam -pe -extrap 2.1e9 -step 1e8 -seg_len 1000000000 -output out.preseq mapped.PT.bam

In this example the output file out.preseq will detail the extrapolated complexity curve of your library, with the number of reads in the first column and the expected distinct read value in the second column. For a typical experiment (human sample) check the expected complexity at 400M reads (to show the content of the file, type cat out.preseq). Expected unique pairs at 400M sequencing is at least ~ 125 million

QC Assessment

Pass/No Pass Metrics
- No-Dup Cis Read Pairs >= 1kb – This value demonstrates that the proximity-ligation step was successful, and the majority of the data are useful in downstream analyses (e.g. loop calling).
- For Shallow QC Sequencing Complexity at 400M Read Pairs – This value informs how many unique reads a library can support.
- For Deep - sequencing No-Dup Read Pairs
Pass/No Pass Values
- The table below summarizes the minimum passing values for the metrics defined above. The cut-off values were determined for both shallow sequenced (10 million read pairs 2 x 150 bp) and deep sequenced data (200-300 Million read pairs 2 x 150 bp).

Metric	Shallow Sequencing	Deep Sequencing
No-Dup Cis Read Pairs >= 1kb	>40% of no-dup read pairs	>40% of no-dup read pairs
Complexity @ 400M Read Pairs	>125 million	NA
No-Dup Read Pairs	NA	>125 million

Sequencing Recommendations

TopoLink was designed to support looping calling with one sample. This requires generating four libraries from a single proximity-ligation reaction. This does not mean you need to sequence all four libraries. The amount of sequencing and the number of libraries you need to to sequence is dependent on the feature you are trying to detect and the resolution (or bin size) you wish to call features at. The table below outlines the number of libraries, total sequencing depth in read pairs, and how many read pairs are needed per library, and finally the minimal amount of no-dup read pairs summed across the libraries for each feature at given resolutions:

Feature	Resolution	Total # libraries	Total # read pairs	Total # read pairs per library	Minimal # of no-dup read pairs summed across libraries
A/B Compartments	50-100 kb	1	200 Million	200 Million	>80 Million
TADS	25 kb	2	400 Million	200 Million	>150 Million
	10 kb	2	600 Million	300 Million	>300 Million
	5 kb	4	800 Million	200 Million	>400 Million
Loops	10 kb	4	800 Million	200 Million	>400 Million
	5 kb	4	1200 Million	300 Million	>500 Million

To generate the most complete matrix you can from a single 500 thousand cell input, you need sequence 4 libraries to a total of 1200 million read pairs (300 million per library).