Modelling DNA-Encoded Libraries (DELs) - Part 1

Code

import warnings
warnings.filterwarnings("ignore", module="plotnine")

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotnine as gg

What Is the Aim of This Blog?

This three-part series develops a probabilistic framework for DNA-encoded library (DEL) data. The goal is to explain noisy sequencing counts in terms of latent compound-level and synthon-level effects so that true enrichment can be separated from technical and compositional artefacts. Part 1 focuses on likelihood choice, data-generation assumptions, and the modelling implications of background effects, library composition, and replicate variation.

What Part 1 Does and Does Not Cover

Part 1 covers DEL likelihoods, modelling assumptions, exploratory data analysis, and the implications these have for model design.
Part 2 will introduce a probabilistic denoising count model.
Part 3 will extend that model to a synthon-level, attachment-aware setting.

Introduction: What Are DNA-Encoded Libraries (DELs)?

DNA-encoded libraries (DELs) are large collections of small molecules, each tagged with a unique DNA barcode that records its synthetic history. They enable high-throughput screening by allowing compounds to be pooled, selected, and then identified by sequencing. For a quick introduction, these two short videos give a useful overview: DNA-Encoded Libraries - A New Paradigm in Drug Design? and Learn DNA-Encoded Library technology for early drug discovery.

DEL Likelihood Functions

What is the right likelihood for DEL count data?

A DEL experiment involves several key stochastic steps that directly impact the statistical modelling of observed counts:

Library synthesis: initial molecule counts for each barcode can vary, introducing variability from the outset.
Selection or binding step: enriches certain barcodes based on their binding affinity, leading to barcode-specific changes in abundance.
PCR amplification: contributes substantial multiplicative noise and is a major source of overdispersion in the data.
Sequencing: captures a random, approximately Poisson-distributed sample from the amplified pool.

Thus, the observed DEL counts for each barcode are the outcome of a sequence of random processes, with amplification especially contributing to count heterogeneity.

Two data-generation perspectives for DEL counts

It is useful to separate DEL count models by the data-generation assumption they make about sequencing depth.

1) Independent-barcode (rate-based) generation

Each barcode (compound) produces its own count independently according to a barcode-specific rate.
The total read depth is then the sum of independent counts.
This perspective naturally motivates Poisson, Negative Binomial, Poisson-Lognormal, and their zero-inflated variants.

Figure 1: Independent-barcode generation: each barcode generates counts independently; totals add up.

2) Fixed-depth (composition-based) generation

A sample has a fixed total read depth N, and each read is assigned to exactly one barcode.
Barcodes therefore share a single pool of reads, inducing weak negative dependence between barcode counts.
This perspective naturally motivates Multinomial and Dirichlet-Multinomial, and the one-barcode marginal Binomial/Beta-Binomial.

Figure 2: Fixed-depth generation: a fixed total N is split across barcodes that share one pool.

Common likelihood functions for DELs

Below are common likelihood choices for DEL count data. In practice, DEL data are often overdispersed (variance > mean) due to PCR/amplification and other multiplicative factors. They may also exhibit excess zeros because of true absence, synthesis failures, or aggressive filtering.

Poisson

\[ X_i \sim \text{Poisson}(\lambda_i) \]

This likelihood belongs to the independent-barcode (rate-based) data-generation view, where each barcode produces counts according to its own latent rate.

Because sequencing counts in DEL experiments arise from random sampling of amplified DNA molecules, the sequencing step is often approximated using a Poisson distribution.

References:

When is Poisson appropriate?

Use Poisson when counts are well-explained by a single rate and the mean–variance relationship is close to \(\mathrm{Var}(X) \approx \mathbb{E}[X]\) after accounting for obvious covariates (e.g., sequencing depth / library size).

Notice that Poisson is usually too narrow for DEL data once PCR noise is present.

Poisson-Gamma or Negative Binomial (NB)

\[ X_i \mid \lambda_i \sim \text{Poisson}(\lambda_i), \qquad \lambda_i \sim \text{Gamma}(\alpha, \beta) \]

\[ X_i \sim \text{NB}(\mu_i, \phi) \]

This likelihood also keeps the independent-barcode (rate-based) view but allows extra variability beyond Poisson.

Because DEL sequencing readouts often exhibit overdispersion due to amplification and selection variability, barcode counts can be modelled using a Negative Binomial distribution.

References:

When is Negative Binomial appropriate?

Use NB when you see variance increasing faster than the mean, typically well-captured by \(\mu + \phi\mu^2\). This is often the default for DEL counts because amplification introduces extra variability.

If you model paired pre-/post-selection counts, NB is often used for each condition with shared/related dispersion:

\[ X_i^\text{pre} \sim \text{NB}(\mu_i^\text{pre}, \phi) \]

\[ X_i^\text{post} \sim \text{NB}(\mu_i^\text{post}, \phi) \]

The dispersion parameter \(\phi\) represents technical variability, not biological variability.

Because the PCR workflow, sequencing platform, and library chemistry are broadly shared across pre- and post-selection samples, assuming similar dispersion is often reasonable as a first approximation.

\[ \phi^\text{pre} \approx \phi^\text{post} = \phi \]

Or, use a common prior for the dispersion parameter:

\[ \phi^\text{pre}, \phi^\text{post} \sim \text{common prior} \]

Zero-Inflated Models

\[ X_i \sim \text{ZIP}(\lambda_i, \pi_i) \]

\[ X_i \sim \text{ZINB}(\mu_i, \phi, \pi_i) \]

Zero-inflated Poisson and Zero-inflated Negative Binomial are extensions of the independent-barcode (rate-based) view, adding an extra mechanism for structural zeros on top of barcode-specific count generation.

These models assume that zeros can arise from two distinct mechanisms:

\[ X_i = 0 \quad \text{with probability } \pi_i \]

and

\[ X_i \sim \text{Poisson}(\lambda_i) \quad \text{with probability } 1 - \pi_i \quad \text{(for ZIP)} \]

\[ X_i \sim \text{NB}(\mu_i, \phi) \quad \text{with probability } 1 - \pi_i \quad \text{(for ZINB)} \]

In ZIP/ZINB, with probability \(\pi_i\) an observation is a structural zero (e.g., true absence or synthesis failure), and otherwise counts follow Poisson or Negative Binomial. In DEL data, structural absence is rarely observed directly, so zero inflation often reflects dropouts or technical noise; it can improve fit, but overdispersed Negative Binomial models are often sufficient, so zero inflation should be a diagnostic-driven choice, not a default.

References:

DEL-Ranking: Ranking-Correction Denoising Framework for Elucidating Molecular Affinities in DNA-Encoded Libraries - Check the reviewers comments.
Compositional Deep Probabilistic Models of DNA-Encoded Libraries - ZIP used to model DEL data.

When are Zero-inflated models appropriate?

Zero-inflated models assume that zeros can arise through two mechanisms:

Structural zeros: these are zeros caused by:
- Synthesis failure for certain barcodes
- Missing synthons or ligation errors for certain barcodes
- Amplification failure causing barcode dropout
- Compound absence in the selection pool
Observed zeros: these are zeros caused by sampling, where the molecule was simply not sequenced.

Structural zeros are conceptually different from sampling zeros, which Poisson or Negative Binomial can already explain. This is important because many apparent extra zeros are already explained by overdispersion in the Negative Binomial model, not a separate zero-generation process.

Zero-inflated models are appropriate when:

Poisson or Negative Binomial alone underpredicts zeros in posterior predictive checks

The diagnostic is to check if observed zero fraction is larger than simulated zero fraction from Poisson or Negative Binomial
Known synthesis failure mechanisms

For instance:
- Specific synthons systematically fail
- Library construction known to have dropout
- Certain cycles produce missing compounds
These are structural zeros that are not explained by overdispersion.
Extremely sparse libraries

If most barcodes are absent or filtered out, then ZIP and ZINB are a better fit than Poisson or Negative Binomial.

Notice that:

If sparsity is already explained by low counts, then ZIP and ZINB are not necessary.
If filtering is the main source of zeros, then ZIP and ZINB are not necessary.

If zeros are mainly caused by thresholding / filtering (i.e., you only record nonzeros above a cutoff), a hurdle model may be a better conceptual fit than ZIP, ZINB.

Poisson-Lognormal

In the Negative Binomial model, the rate parameter is assumed to vary according to a gamma distribution. In contrast, the Poisson-Lognormal model assumes that the logarithm of the rate parameter follows a normal (lognormal) distribution.

\[ X_i \mid \lambda_i \sim \text{Poisson}(\lambda_i), \qquad \lambda_i \sim \text{Lognormal}(\mu_i, \sigma^2) \]

Like Poisson and Negative Binomial, this likelihood belongs to the independent-barcode (rate-based) data-generation view, but uses a lognormal distribution to model variation in barcode-specific rates.

A lognormal model for the rate has heavier tails than a gamma model, so it tends to fit DEL data better when a small number of barcodes show very large enrichment.

References:

There is no well-known DEL paper that explicitly uses a Poisson-Lognormal likelihood, although it is a statistically valid option for DEL data.

When is Poisson-Lognormal appropriate?

Use Poisson-Lognormal when noise is naturally multiplicative on the rate (PCR / assay effects), giving a log-normal-ish spread in effective rates across observations.

Multinomial / Dirichlet-Multinomial

If we model counts conditional on a fixed total read depth (library size) in a sample:

\[ \mathbf{X} \sim \text{Multinomial}(N, \mathbf{p}) \]

This likelihood belongs to the fixed-depth (composition-based) data-generation view. Here, \(N\) is the total read depth \((N = \sum_{i=1}^K X_i)\), where \(K\) is the number of barcodes, and \(\mathbf{p}\) is the vector of barcode proportions \((p_i = X_i / N)\).

An overdispersed extension is the Dirichlet-Multinomial.

\[ \mathbf{X} \mid \mathbf{p} \sim \text{Multinomial}(N, \mathbf{p}), \qquad \mathbf{p} \sim \text{Dirichlet}(\alpha) \]

References:

There is no well-known DEL paper that explicitly uses a Multinomial / Dirichlet-Multinomial likelihood. Check When are Multinomial / Dirichlet-Multinomial appropriate? for more details.

When are Multinomial / Dirichlet-Multinomial appropriate?

Use these when primary variability is driven by relative composition under a fixed (or modelled) total depth. This is common when comparing barcodes within the same sequencing run.

In practice:

Multinomial can be too narrow; Dirichlet-Multinomial adds extra variability in proportions.
Multinomial / Dirichlet-Multinomial are especially natural for modelling within-sample composition, whereas Poisson and Negative Binomial are common for per-barcode marginal modelling.
- Within-sample perspective: Given one sample with total reads \(N\), how are those reads split across all barcodes?
  - Model the joint vector of proportions/counts in that sample using a Multinomial distribution.
- Per-barcode perspective: For barcode \(i\), what count do I expect?
  - Model each barcode’s count distribution individually (often approximately independent, especially in sparse DEL settings).
Most DEL models do not use Multinomial / Dirichlet-Multinomial likelihoods, mainly for practical reasons:
1. Libraries are extremely large and have millions of barcodes. A multinomial likelihood becomes computationally difficult.
2. When counts are small relative to \(N\), then \(\text{Multinomial} \approx \text{independent Poisson}\).

Intuition:

Multinomial perspective: Imagine you have a total of \(N\) sequencing reads, and you assign each one to a barcode. The barcodes share the total pool of reads.
Negative Binomial/Poisson perspective: Instead, think of each barcode as producing its own number of reads independently, according to its own rate. The total number of reads is simply the sum of these independent counts.

Why does \(\text{Multinomial} \approx\) independent Poisson when counts are small relative to \(N\)?

Let \(\mathbf{X}=(X_1,\dots,X_K) \sim \text{Multinomial}(N,\mathbf{p})\), where \(\sum_{i=1}^K p_i = 1\) and \(\sum_{i=1}^K X_i = N\).

For a fixed barcode \(i\), the multinomial marginal is binomial:

\[ X_i \sim \text{Binomial}(N,p_i), \qquad \mathbb{E}[X_i]=Np_i \]

So \(Np_i\) is the expected read count for barcode \(i\): among \(N\) reads, each read lands in barcode \(i\) with probability \(p_i\).

In the DEL rare-event / large-depth regime, we consider:

\(N\) is large,
Each barcode probability \(p_i\) is small,
\(Np_i\) stays finite (typically on the order of 1).

Under this scaling (\(N\to\infty\), \(p_i\to 0\), \(Np_i\to\text{fixed value}\)), each marginal converges as \[ X_i \Rightarrow \text{Poisson}(\lambda_i), \quad \lambda_i = Np_i. \]

The derivation of the binomial-to-Poisson limit and the reason the dependence becomes negligible are discussed in the two boxes below.

So in the rare-event / large-depth regime, the multinomial can be viewed as the product of independent Poisson counts conditioned on a fixed total:

\[ \mathbf{X} \approx \prod_{i=1}^K \text{Poisson}(\lambda_i), \qquad \lambda_i=Np_i. \]

Why does the binomial distribution converge to Poisson in this regime?

The probability mass function (PMF) of the Binomial distribution is: \[ \Pr(X_i=k)=\binom{N}{k}p_i^k(1-p_i)^{N-k} \]

Where \(N\) is the total number of trials (or our experiment’s read depth), \(k\) is the number of successes (or our read count for a given barcode), and \(p_i\) is the probability of success on each trial (or our barcode enrichment probability).

Recall the conditions of the rare-event / large-depth regime:

\(N\) is large,
Each barcode probability \(p_i\) is small,
\(Np_i\) stays finite (typically on the order of 1).

Under this scaling, and assuming that Poisson’s parameter \(\lambda_i\) is equal to \(Np_i\), the PMF of the binomial distribution converges to the PMF of a Poisson distribution.

Start by rewriting \(p_i\) as: \[ p_i = \frac{\lambda_i}{N}. \]

Then the PMF of the binomial distribution becomes:

\[ \Pr(X_i=k)=\binom{N}{k}\left(\frac{\lambda_i}{N}\right)^k\left(1-\frac{\lambda_i}{N}\right)^{N-k}. \]

Recall that: \[\binom{N}{k} = \frac{N!}{k!(N-k)!}\]

and

\[N! = N(N-1)(N-2)(N-3)\cdots 3\times 2\times 1.\]

It can be written as: \[N! = N(N-1)\cdots(N-k+1)(N-k)!\]

This notation is useful because it shows that in \(\binom{N}{k} = \frac{N!}{k!(N-k)!}\), the \((N-k)!\) terms in the numerator and denominator cancel each other out. Hence, we can rewrite the binomial coefficient as:

\[ \binom{N}{k} = \frac{N(N-1)\cdots(N-k+1)}{k!}. \]

Therefore, the PMF of the binomial distribution becomes: \[ \Pr(X_i=k)=\binom{N}{k}\left(\frac{\lambda_i}{N}\right)^k\left(1-\frac{\lambda_i}{N}\right)^{N-k} = \frac{N(N-1)\cdots(N-k+1)}{k!}\frac{\lambda_i^k}{N^k}\left(1-\frac{\lambda_i}{N}\right)^{N-k}. \]

Focus first on the first two terms:

\[ \frac{N(N-1)\cdots(N-k+1)}{k!}\frac{\lambda_i^k}{N^k} \]

This can be written as:

\[ \frac{\lambda_i^k}{k!}\cdot \frac{N}{N}\cdot\frac{N-1}{N}\cdots\frac{N-k+1}{N} \]

If we set \(\frac{\lambda_i^k}{k!}\) aside, then the product of the remaining terms can be written as a product of \(k\) terms (because \(N^k\) is in the denominator), each of which is of the form \(\frac{N-j}{N}\) for \(j=0,1,2,\dots,k-1\) (\(k-1\) because the first term is \(\frac{N}{N}\)), namely: \[ \frac{N}{N}\cdot\frac{N-1}{N}\cdots\frac{N-k+1}{N} = \prod_{j=0}^{k-1}\left(1-\frac{j}{N}\right) \]

Bringing back the first term gives: \[ \frac{\lambda_i^k}{k!}\cdot \prod_{j=0}^{k-1}\left(1-\frac{j}{N}\right) \]

In the rare-event / large-depth regime, as \(N\to\infty\), each term goes to 1, i.e.,

\[ \left(1-\overset{\longrightarrow \ 0}{\cancel{\frac{j}{N}}}\right)\longrightarrow 1 \quad (N\to\infty), \]

That is because \(k\) is fixed (it does not grow with \(N\)). Thus, \(\frac{j}{N}\to 0\) for each \(j=0,1,2,\dots,k-1\). Hence, the product of the remaining terms tends to \(1\):

\[ \prod_{j=0}^{k-1}\left(1-\overset{\longrightarrow \ 0}{\cancel{\frac{j}{N}}}\right)\longrightarrow 1. \]

Therefore, in the rare-event / large-depth regime, the first two terms of the PMF of the binomial distribution converge to:

\[ \boxed{\binom{N}{k}p_i^k = \binom{N}{k}\left(\frac{\lambda_i}{N}\right)^k = \frac{N(N-1)\cdots(N-k+1)}{k!}\frac{\lambda_i^k}{N^k} = \frac{\lambda_i^k}{k!}.\overset{\longrightarrow\,1}{\cancel{\prod_{j=0}^{k-1}\left(1-\overset{\longrightarrow \ 0}{\cancel{\frac{j}{N}}}\right)}} = \frac{\lambda_i^k}{k!}.1 = \frac{\lambda_i^k}{k!}} \]

Now focus on the third term:

\[ \left(1-\frac{\lambda_i}{N}\right)^{N-k} \]

This term can be written as:

\[ \left(1-\frac{\lambda_i}{N}\right)^N \left(1-\frac{\lambda_i}{N}\right)^{-k} \]

Recall that (this is a well-known result): \[ \lim_{N\to\infty} \left(1-\frac{a}{N}\right)^N = e^{-a} \]

Therefore as \(N\to\infty\), the first term converges to \(e^{-\lambda_i}\): \[ \lim_{N\to\infty} \left(1-\frac{\lambda_i}{N}\right)^N = e^{-\lambda_i} \]

The second term converges to \(1\) as \(N\to\infty\):

\[ \lim_{N\to\infty} \left(1-\overset{\longrightarrow\,0}{\cancel{\frac{\lambda_i}{N}}}\right)^{-k} = 1 \]

That is because \(k\) is fixed (it does not grow with \(N\)). Thus, \(\frac{\lambda_i}{N}\to 0\) as \(N\to\infty\).

Putting it all together, the third term converges to:

\[ \boxed{\left(1-\frac{\lambda_i}{N}\right)^{N-k} = \left(1-\frac{\lambda_i}{N}\right)^N \left(1-\frac{\lambda_i}{N}\right)^{-k} = e^{-\lambda_i}\cdot 1 = e^{-\lambda_i}} \]

Combining this with the earlier result \(\frac{\lambda_i^k}{k!}\) gives:

\[ \boxed{\Pr(X_i=k) = e^{-\lambda_i}\frac{\lambda_i^k}{k!}} \]

This is exactly the PMF of a Poisson distribution: \[ \boxed{\Pr(X_i=k)=e^{-\lambda_i}\frac{\lambda_i^k}{k!}} \]

Therefore, the PMF of the binomial distribution converges to the PMF of a Poisson distribution:

\[ \Pr(X_i=k)\longrightarrow e^{-\lambda_i}\frac{\lambda_i^k}{k!} = \Pr(X_i=k)\text{ (PMF of Poisson distribution)}. \]

Why independent Poisson?

Jointly, the multinomial constraint \(\sum_i X_i=N\) induces weak negative dependence between barcodes.

Intuitively, the total number of reads is a fixed budget: if one barcode receives more reads than expected, fewer reads are available for the others. Formally, the covariance between two barcodes is negative in a multinomial model,

\[ \mathrm{Cov}(X_i,X_j)=-Np_ip_j \quad (i\neq j), \]

The negative value indicates the intuition mentioned above: an increase in one variable often requires a decrease in the other, as the sum of all variables is fixed at \(N\). So, the covariance is negative by construction. Equivalently, in terms of proportions \(\hat p_i=X_i/N\),

\[ \mathrm{Cov}(\hat p_i,\hat p_j)=-\frac{p_ip_j}{N}, \]

which shrinks to \(0\) as \(N\) grows. In DEL, each \(p_i\) is also very small, so these cross-barcode covariances are tiny. That is why the dependence is usually negligible in practice, and the joint distribution is well approximated by a product of independent Poisson variables:

\[ X_i \approx \text{Poisson}(\lambda_i), \qquad \lambda_i = Np_i,\quad i=1,\dots,K. \quad \text{(where K is the number of barcodes)} \]

This is why DEL models often treat per-barcode counts as independent Poisson (or overdispersed Negative Binomial): it is an accurate and computationally convenient approximation in the sparse-count regime.

Binomial / Beta-Binomial

These likelihoods are the one-barcode marginals of the fixed-depth (composition-based) data-generation view: they model how one barcode’s share behaves when the total read depth is fixed.

When modelling a single barcode’s counts conditional on fixed total read depth \(N\), we can treat the number of reads assigned to barcode \(i\) as the marginal of a multinomial model:

\[ \mathbf{X} \sim \text{Multinomial}(N,\mathbf{p}) \quad \Rightarrow \quad X_i \sim \text{Binomial}(N,p_i) \]

with \(p_i\) the true proportion of barcode \(i\) in the pool. Binomial assumes a fixed probability per read and variance \(N p_i(1-p_i)\), which can be too narrow when enrichment probabilities vary across experiments, PCR amplification, or selection noise.

A more overdispersed alternative is the Beta-Binomial.

\[ X_i \sim \text{BetaBinomial}(N, a_i, b_i) \]

References:

Quantitative Comparison of Enrichment from DNA-Encoded Chemical Library Selections - Introduces a binomial distribution model to compute normalised enrichment z-scores from DEL selections.

When is Beta-Binomial appropriate?

The Beta prior adds extra variability in enrichment probability, analogous to how the Negative Binomial adds extra variability to Poisson rates.

The Beta-Binomial is the one-dimensional marginal of the Dirichlet-Multinomial: if one barcode is treated as “success” and all others as “failure,” the Dirichlet prior on proportions reduces to a Beta prior for that barcode.

Beta-Binomial is particularly natural when modelling:

One barcode versus the background library
Binder-versus-non-binder formulations
Selections with fixed total sequencing depth

In practice:

In large DEL libraries, Beta-Binomial can be easier to use than Dirichlet-Multinomial because it avoids modelling all barcodes jointly.
When counts are small relative to \(N\), then \(\text{BetaBinomial} \approx \text{Negative Binomial}\).

Intuition:

Dirichlet-Multinomial models the entire library composition
Beta-Binomial models one barcode’s share of reads

Why does the Beta-Binomial approach the Negative Binomial when counts are small relative to sequencing depth \(N\)?

Start with the Beta-Binomial model:

\[ X_i \sim \text{Binomial}(N, p_i) \quad \text{where} \quad p_i \sim \text{Beta}(a, b) \]

In the rare-event/ large-depth regime, as \(N\to\infty\), the Binomial distribution converges to a Poisson distribution:

Check Why does the Binomial distribution converge to Poisson in this regime? box in the Multinomial / Dirichlet-Multinomial section for more details.

To complete the argument, we now need the analogous result for the prior on the success probability: under the same rare-event scaling, the Beta prior on \(p\) can be written as a Gamma prior on \(\lambda = Np\). Once that happens, the Beta-Binomial becomes a Poisson-Gamma mixture, which is also known as the Negative Binomial:

Why use the Poisson-scale intensity instead of the success probability \(p\)?

Under the rare-event scaling, the success probability \(p\) becomes very small while the Poisson-scale intensity \(\lambda = Np\) stays finite:

The difficulty with working directly with \(p\) is that its numerical scale depends on the sequencing depth \(N\). As \(N\) grows, the relevant values of \(p\) are pushed closer and closer to zero. So the same underlying barcode abundance gets represented by different tiny probabilities depending on how deeply we sequence. That makes \(p\) harder to interpret and harder to analyze in the large-\(N\) limit.

By contrast, \(\lambda = Np\) lives on the count scale. Under the Poisson approximation, it has a direct interpretation as the expected number of reads for a barcode:

\[ \lambda = \mathbb{E}[X] \]

So \(\lambda\) is easier to work with for three reasons:

Interpretability: \(\lambda\) is on the same scale as the observed counts, while very small probabilities are less intuitive.
Stable limiting behaviour: in the rare-event regime, \(p \to 0\) but \(\lambda\) can remain finite and non-degenerate.
Cleaner modelling: in the rare-event regime, \(\lambda\) is the natural parameter for count models, so both the likelihood and the limiting prior take a simpler and more interpretable form on the \(\lambda\) scale than on the \(p\) scale.

Concrete example: consider two sequencing experiments in which the same barcode is observed with count \(x = 2\), but the total sequencing depth differs.

If \(N = 10^3\), the corresponding empirical fraction is \(x/N = 2/1000 = 0.002\).
If \(N = 10^6\), the same count gives \(x/N = 2/1{,}000{,}000 = 0.000002\).

The observed count is identical in the two experiments, but the probability-scale quantity changes by a factor of 1000 simply because the total depth changed. By contrast, the Poisson-scale parameter \(\lambda = Np\) remains on the count scale, where it is naturally interpreted as an expected count:

If \(N = 10^3\), \(\lambda = Np = 10^3 \times 0.002 = 2\).
If \(N = 10^6\), \(\lambda = Np = 10^6 \times 0.000002 = 2\).

That makes \(\lambda\) much easier to compare, model, and reason about than raw tiny values of \(p\).

\(p\) is natural when thinking in terms of read-assignment probabilities, but \(\lambda = Np\) is usually the more useful quantity in sparse, high-depth DEL data because it stays meaningful on the scale of the observed counts.

First, we reparameterise the Beta prior on \(p\) to account for the rare-event (large \(N\), small \(p\)) regime. Next, we use the change-of-variables formula to transform the resulting density for \(p^{(N)}\) into a density for the Poisson-scale variable \(\lambda^{(N)} = Np^{(N)}\).

1. Rare-event reparameterisation of the Beta prior on \(p\)

To keep expected counts in the sparse regime as sequencing depth grows, we let the Beta prior depend on N.

The Beta prior on \(p\) is given by:

\[ p \sim \text{Beta}(a, b) \]

To study the rare-event regime, we let the prior itself depend on the sequencing depth \(N\). The idea is that as \(N\) grows, the per-barcode probability \(p\) should become correspondingly smaller so that expected counts stay in the sparse-count regime. A convenient way to encode this is to keep the first Beta parameter fixed and scale the second one linearly with depth: \[ b^{(N)} = \eta N \]

\[ p^{(N)} \sim \text{Beta}(a, b^{(N)}) \]

The superscript \((N)\) does not index a barcode. It indicates that we are considering a whole sequence of random variables, one for each sequencing depth \(N\).

So when \(N\) changes, the prior changes as well. In particular, the second Beta parameter grows with \(N\), which pushes typical values of \(p^{(N)}\) toward zero. This is the scaling we want in a rare-event setting, where each individual barcode should occupy only a tiny fraction of the total reads even as the total read depth becomes large.

Code

import numpy as np
import pandas as pd
import plotnine as gg
from scipy.stats import beta as beta_dist

beta_shape_df = pd.DataFrame({"p": np.linspace(1e-4, 0.3, 800)})

beta_shape_df = pd.concat(
    [
        beta_shape_df.assign(b="b = 2", density=beta_dist.pdf(beta_shape_df["p"], a=2, b=2)),
        beta_shape_df.assign(b="b = 10", density=beta_dist.pdf(beta_shape_df["p"], a=2, b=10)),
        beta_shape_df.assign(b="b = 50", density=beta_dist.pdf(beta_shape_df["p"], a=2, b=50)),
        beta_shape_df.assign(b="b = 200", density=beta_dist.pdf(beta_shape_df["p"], a=2, b=200)),
    ],
    ignore_index=True,
)
beta_shape_df["b"] = pd.Categorical(
    beta_shape_df["b"],
    categories=["b = 2", "b = 10", "b = 50", "b = 200"],
    ordered=True,
)

beta_shape_plot = (
    gg.ggplot(beta_shape_df, gg.aes(x="p", y="density", color="b"))
    + gg.geom_line(size=1.1)
    + gg.coord_cartesian(xlim=(0, 0.2))
    + gg.scale_color_manual(
        values=["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728"],
        breaks=["b = 2", "b = 10", "b = 50", "b = 200"],
    )
    + gg.labs(
        x="Success probability p",
        y="Beta density",
        color="Beta parameter",
        title="Holding a fixed, larger b shifts Beta(a, b) toward smaller p",
    )
    + gg.theme_bw()
)

beta_shape_plot.save("img/kindel_beta_shape_plot.png", verbose=False)

Effect of increasing b in a Beta(a, b) prior while keeping a = 2 fixed.

Figure: Effect of increasing \(b\) in a Beta\((a,b)\) prior while keeping \(a = 2\) fixed.

Why should \(p^{(N)}\) and \(\text{Beta}(a,b^{(N)})\) depend on sequencing depth?

Recall the concrete example in the previous Why work with Poisson-scale intensity instead of success probability? box.

Consider two sequencing experiments in which the same barcode is observed with count \(x = 2\), but the total sequencing depth differs.

If \(N = 10^3\), the corresponding empirical fraction is \(x/N = 2/1000 = 0.002\).

If \(N = 10^6\), the same count gives \(x/N = 2/1{,}000{,}000 = 0.000002\).

If \(N = 10^3\), \(\lambda = Np = 10^3 \times 0.002 = 2\).

If \(N = 10^6\), \(\lambda = Np = 10^6 \times 0.000002 = 2\).

The point of writing \(p^{(N)}\) is to emphasize that the relevant success probability changes with sequencing depth. The product \(Np^{(N)}\) stays on the interpretable Poisson scale, while \(p^{(N)}\) itself becomes smaller as \(N\) grows.

The same logic explains why the Beta prior should also depend on \(N\). In the rare-event regime, we do not want \(p\) to stay fixed as \(N\) grows, because then the Poisson-scale quantity \(\lambda = Np\) would grow without bound. Instead, we want \(p^{(N)}\) to shrink at roughly the rate \(1/N\), so that \(\lambda^{(N)} = Np^{(N)}\) stays of order 1.

To see why, start from the Beta mean:

\[ \mathbb{E}[p^{(N)}] = \frac{a}{a + b^{(N)}}. \]

If we scale the second parameter with sequencing depth by setting

\[ b^{(N)} = \eta N, \]

then

\[ \mathbb{E}[p^{(N)}] := \frac{a}{a + \eta N}. \]

Now rewrite the denominator:

\[ a + \eta N = \eta N \left(1 + \frac{a}{\eta N}\right). \]

Substituting this back gives

\[ \mathbb{E}[p^{(N)}] := \frac{a}{a + \eta N} := \frac{a}{\eta N \left(1 + \frac{a}{\eta N}\right)} := \frac{a}{\eta N}\cdot \frac{1}{1 + \underbrace{\frac{a}{\eta N}}_{\to 0}} \approx \frac{a}{\eta N} \quad \text{as} \quad N \to \infty. \]

So the prior mean of \(p^{(N)}\) is automatically of order \(1/N\). This is exactly the behaviour we want: as sequencing depth grows, the success probability should shrink so that the Poisson-scale quantity

\[ \lambda^{(N)} = Np^{(N)} \]

stays on a stable count scale rather than blowing up.

Equivalently, if we want the Poisson-scale intensity to stay around \(\lambda \approx 2\), as in the concrete example above, we want

\[ N\,\mathbb{E}[p^{(N)}] \approx 2. \]

One simple choice is \(a = 2\) and \(\eta = 1\), because then

\[ N\,\mathbb{E}[p^{(N)}] := N \cdot \frac{2}{2 + (1 \cdot N)} := \frac{2N}{2 + N} \approx 2 \quad \text{as} \quad N \to \infty. \]

So the scaling is not arbitrary. It makes the prior on \(p^{(N)}\) shift toward smaller probabilities as sequencing depth increases, while the induced prior on \(\lambda^{(N)} = Np^{(N)}\) stays on a stable, interpretable count scale.

This is why the scaling \(b^{(N)} = \eta N\) matters: it prevents the implied count scale from blowing up with \(N\) and sets up the next derivation, where the prior on \(p^{(N)}\) becomes a Gamma prior on \(\lambda^{(N)}\).

The plot below is meant only to visualise this scaling. For the concrete example above with \(a = 2\) and \(\eta = 1\), it compares the priors induced by two sequencing depths, \(N = 10^3\) and \(N = 10^6\). As depth increases, the Beta density shifts sharply toward smaller values of \(p\), so that the corresponding Poisson-scale quantity \(\lambda^{(N)} = Np^{(N)}\) remains on a stable count scale rather than diverging with \(N\).

Code

import numpy as np
import pandas as pd
import plotnine as gg
from scipy.stats import beta as beta_dist
from mizani.formatters import scientific_format

beta_example = pd.DataFrame({
    "p": np.logspace(-8, -1, 800),
})

beta_example = pd.concat(
    [
        beta_example.assign(
            N="N = 10^3",
            density=beta_dist.pdf(beta_example["p"], a=2, b=10**3),
        ),
        beta_example.assign(
            N="N = 10^6",
            density=beta_dist.pdf(beta_example["p"], a=2, b=10**6),
        ),
    ],
    ignore_index=True,
)
beta_example = beta_example.loc[beta_example["density"] > 0].copy()

beta_depth_plot = (
    gg.ggplot(beta_example, gg.aes(x="p", y="density", color="N"))
    + gg.geom_line(size=1.2)
    + gg.scale_x_log10(labels=scientific_format(digits=1))
    + gg.scale_y_log10(labels=scientific_format(digits=1))
    + gg.scale_color_manual(values=["#1f77b4", "#d62728"])
    + gg.labs(
        x="Success probability p",
        y="Beta density (log scale)",
        color="Sequencing depth",
        title="Increasing sequencing depth shifts Beta(a, \\eta N) toward smaller p",
    )
    + gg.theme_bw()
    + gg.theme(
        figure_size=(7, 4.8),
        panel_grid_minor=gg.element_blank(),
        panel_grid_major=gg.element_line(color="#e5e5e5", size=0.3),
        plot_title=gg.element_text(weight="bold"),
        legend_position=(0.72, 0.83),
    )
)

beta_depth_plot.save("img/kindel_beta_depth_plot.png", verbose=False)

Figure: Comparison of the example priors \(\text{Beta}(2, 1 \cdot 10^3)\) and \(\text{Beta}(2, 1 \cdot 10^6)\) on the success-probability scale.

The point of the plot is not merely that \(p\) gets smaller, but that it gets smaller at exactly the rate needed for \(\lambda^{(N)} = Np^{(N)}\) to stay finite. The log-scaled y-axis is used only to make both densities visible in a single overlay.

2. Change-of-variables formula for densities

For accessible refreshers on the change-of-variables formula, see this video from The Bright Side of Mathematics and read this section of Lilian Weng’s blog on Flow-based Deep Generative Models.

If a random variable \(X\) has density \(f_X(x)\) and we define a new variable

\[ Y = g(X) \]

where \(g\) is one-to-one and differentiable, then the density of \(Y\) is

\[ f_Y(y) = f_X\bigl(g^{-1}(y)\bigr)\left|\frac{d}{dy}g^{-1}(y)\right| \]

The key idea is that a change of variables does not create or destroy probability, it only redistributes it along a new axis. If \(g\) stretches the axis, the same probability is spread over a wider interval, so the density goes down; if \(g\) compresses the axis, that same probability is packed into a narrower interval, so the density goes up. The derivative term tells us exactly how much stretching or compression is happening.

3. Combine the change-of-variables formula with the rare-event reparameterisation

The goal is to show that the transformed prior converges to a Gamma distribution on the Poisson scale.

The density of \(X = p^{(N)}\) is:

\[ f_{p^{(N)}}(p) := \frac{1}{B(a,b^{(N)})} p^{a-1}(1-p)^{b^{(N)}-1}, \qquad 0 < p < 1. \]

Now define the transformed random variable: \[ \lambda^{(N)} = Np^{(N)}. \]

At the level of values, the inverse transformation is: \[ p = \frac{\lambda}{N}. \]

And, in the notation of the general change-of-variables formula \(Y = g(X)\), we have:

\(X = p^{(N)}\),
\(Y = \lambda^{(N)}\),
\(g(X) = NX = Np^{(N)}\),
at the level of values, \(x = p\), \(y = \lambda\), and \(g(x) = Nx = Np\).

So the inverse transformation is:

\[ g^{-1}(\lambda) = \frac{\lambda}{N}, \]

and its derivative is:

\[ \frac{d}{d\lambda}g^{-1}(\lambda) = \frac{1}{N}. \]

Therefore, the change-of-variables formula gives:

\[ f_{\lambda^{(N)}}(\lambda) = f_{p^{(N)}}\!\left(\frac{\lambda}{N}\right)\left|\frac{1}{N}\right| \]

\[ = \frac{1}{B(a,b^{(N)})} \left(\frac{\lambda}{N}\right)^{a-1} \left(1-\frac{\lambda}{N}\right)^{b^{(N)}-1} \cdot \frac{1}{N}. \]

Now substitute the rare-event scaling \(b^{(N)} = \eta N\):

\[ f_{\lambda^{(N)}}(\lambda) = \frac{1}{B(a,\eta N)} \left(\frac{\lambda}{N}\right)^{a-1} \left(1-\frac{\lambda}{N}\right)^{\eta N-1} \cdot \frac{1}{N}. \]

Collect the powers of \(N\):

\[ f_{\lambda^{(N)}}(\lambda) = \frac{1}{B(a,\eta N)} \lambda^{a-1}N^{-a} \left(1-\frac{\lambda}{N}\right)^{\eta N-1}, \qquad 0 < \lambda < N. \]

Next, rewrite the Beta function in terms of Gamma functions:

\[ B(a,\eta N) = \frac{\Gamma(a)\Gamma(\eta N)}{\Gamma(a+\eta N)}. \]

Therefore,

\[ \frac{1}{B(a,\eta N)} = \frac{\Gamma(a+\eta N)}{\Gamma(a)\Gamma(\eta N)}. \]

Substituting this into the density gives

\[ f_{\lambda^{(N)}}(\lambda) = \frac{\Gamma(a+\eta N)}{\Gamma(a)\Gamma(\eta N)} \lambda^{a-1}N^{-a} \left(1-\frac{\lambda}{N}\right)^{\eta N-1}. \]

Now take the limit as \(N \to \infty\).

For the Gamma-ratio term, use the standard asymptotic fact:

\[ \frac{\Gamma(z+a)}{\Gamma(z)} \sim z^a \qquad \text{as } z \to \infty. \]

Setting \(z = \eta N\), we get:

\[ \frac{\Gamma(a+\eta N)}{\Gamma(\eta N)}N^{-a} \sim (\eta N)^a N^{-a} = \eta^a. \]

So the prefactor converges to:

\[ \frac{\eta^a}{\Gamma(a)}. \]

For the remaining factor:

\[ \left(1-\frac{\lambda}{N}\right)^{\eta N-1} = \left(1-\frac{\lambda}{N}\right)^{\eta N} \left(1-\frac{\lambda}{N}\right)^{-1} \to e^{-\eta\lambda}\cdot 1 = e^{-\eta\lambda}. \]

Therefore:

\[ f_{\lambda^{(N)}}(\lambda) \longrightarrow \frac{\eta^a}{\Gamma(a)}\lambda^{a-1}e^{-\eta\lambda}, \qquad \lambda > 0. \]

This is exactly the Gamma density with shape \(a\) and rate \(\eta\). Hence:

\[ \lambda^{(N)} = Np^{(N)} \overset{d}{\longrightarrow} \text{Gamma}(a,\eta). \]

Now combine this with the likelihood. Under the same rare-event scaling:

\[ X \mid p^{(N)} \sim \text{Binomial}(N,p^{(N)}) \quad \Longrightarrow \quad X \mid \lambda^{(N)} \approx \text{Poisson}(\lambda^{(N)}). \]

So in the limit we have:

\[ \lambda \sim \text{Gamma}(a,\eta), \qquad X \mid \lambda \sim \text{Poisson}(\lambda). \]

The marginal distribution of \(X\) is obtained by integrating out \(\lambda\):

\[ \Pr(X=x) = \int_0^\infty \left(\frac{e^{-\lambda}\lambda^x}{x!}\right) \left(\frac{\eta^a}{\Gamma(a)}\lambda^{a-1}e^{-\eta\lambda}\right) d\lambda. \]

Pull out the constants:

\[ \Pr(X=x) = \frac{\eta^a}{\Gamma(a)x!} \int_0^\infty \lambda^{x+a-1}e^{-(\eta+1)\lambda}\, d\lambda. \]

Now use the Gamma-integral identity:

\[ \int_0^\infty \lambda^{k-1}e^{-c\lambda}\, d\lambda = \frac{\Gamma(k)}{c^k}, \qquad k>0,\; c>0. \]

With \(k = x+a\) and \(c = \eta+1\), we obtain:

\[ \Pr(X=x) = \frac{\eta^a}{\Gamma(a)x!} \cdot \frac{\Gamma(x+a)}{(\eta+1)^{x+a}} \]

\[ = \frac{\Gamma(x+a)}{\Gamma(a)x!} \left(\frac{\eta}{\eta+1}\right)^a \left(\frac{1}{\eta+1}\right)^x. \]

This is the Negative Binomial probability mass function under a standard parameterisation. Therefore:

\[ \text{BetaBinomial}(N,a,\eta N) \;\Longrightarrow\; \text{Negative Binomial} \qquad \text{as } N \to \infty. \]

So the full logic is:

Binomial becomes Poisson in the rare-event / large-depth regime.
Beta on \(p^{(N)}\) becomes Gamma on \(\lambda^{(N)} = Np^{(N)}\).
Poisson mixed with Gamma gives Negative Binomial.

Example: KinDEL dataset

For a concrete example, we use data from the KinDEL paper. You can learn more about the resource here and access the data here.

KinDEL is one of the first large public DEL datasets designed to support machine-learning research in drug discovery. The resource contains approximately 81 million small molecules screened against two kinase targets (MAPK14 and DDR1), with accompanying benchmark tasks aimed at predictive hit identification and DEL denoising. Beyond raw count data, KinDEL also includes replicate experiments and biophysical validation data (on-DNA and off-DNA assays), making it useful for studying how well models recover true binding signals from noisy DEL measurements.

We load only a subset of the data to keep the example simple and fast to run. In particular, we use the data for the DDR1 target from the first split.

Code

data = pd.read_parquet("data/df_train_ddr1_split1_random.parquet")
data.head()

	smiles	molecule_hash	smiles_a	smiles_b	smiles_c	seq_target_1	seq_target_2	seq_target_3	seq_matrix_1	seq_matrix_2	seq_matrix_3	seq_load	y
0	CNC(=O)[C@H](CNC(=O)c1ccccc1CNC(=O)c1ccc(-c2no...	33531ebf0be8ba2787e85a03221df1b41a0512eaf09b1a...	CNC(=O)[C@H](CN)NC(=O)c1ccc2[nH]ccc2c1	CC1=NC(=NO1)C1=CC=C(C=C1)C(O)=O	OC(=O)c1ccccc1CNC(=O)OCC1c2ccccc2-c2ccccc12	1.0	6.0	2.0	0.0	4.0	1.0	19	0.352696
1	CNC(=O)[C@H](CCc1ccccc1)NC(=O)CN1C(=O)[C@@H](N...	519cddc9ad4ab8538ced2bcfbb17e034240c68b9cd52c4...	CNC(=O)[C@@H](N)CCc1ccccc1	OC(=O)CC1=CC(Cl)=C(O)C=C1	OC(=O)CN1c2ccccc2OC[C@H](NC(=O)OCC2c3ccccc3-c3...	5.0	9.0	7.0	1.0	1.0	1.0	12	1.482711
2	CNC(=O)c1sc(C2CCN(C(=O)[C@H](CC3CCCCC3)NCc3csc...	d385e4289ecf6fcce902738ee8ea9ab32c43c31107cc9c...	CNC(=O)c1sc(C2CCNCC2)nc1C	BrC1=NC(C=O)=CS1	OC(=O)[C@H](CC1CCCCC1)NC(=O)OCC1c2ccccc2-c2ccc...	1.0	2.0	2.0	0.0	0.0	0.0	7	0.440103
3	CNC(=O)c1cccc(CNC(=O)[C@H](CC2CCCCC2)NC(=O)c2c...	96075c4092ecbba0786c42c576606e70b25d72bcccdc82...	CNC(=O)c1cccc(CN)c1	CN(C)C1=CC=CC(=C1)C(O)=O	OC(=O)[C@H](CC1CCCCC1)NC(=O)OCC1c2ccccc2-c2ccc...	2.0	2.0	4.0	1.0	3.0	0.0	19	0.337236
4	CNC(=O)[C@@H](CNC(=O)C1CN(C(=O)CCOC)CCO1)NC(=O...	dbb42a3997dd44957450af19bb6a4f4f1ffbdef0beafe4...	CNC(=O)[C@@H](CN)NC(=O)c1cccnc1	COCCC(O)=O	OC(=O)C1CN(CCO1)C(=O)OCC1c2ccccc2-c2ccccc12	4.0	1.0	2.0	1.0	2.0	0.0	26	0.321008

The columns are:

smiles: Canonical SMILES string for the fully assembled DEL molecule (the tri‑synthon product).
molecule_hash: Stable identifier/hash for the assembled molecule (useful for joins and de‑duplication).
smiles_a: SMILES for the synthon/fragment at attachment site A (cycle/position A).
smiles_b: SMILES for the synthon/fragment at attachment site B (cycle/position B).
smiles_c: SMILES for the synthon/fragment at attachment site C (cycle/position C).
seq_target_1: Sequencing count for target selection replicate 1 (number of unique DNA sequences observed for this molecule after selection against the protein target).
seq_target_2: Sequencing count for target selection replicate 2.
seq_target_3: Sequencing count for target selection replicate 3.
seq_matrix_1: Sequencing count for the negative control / matrix replicate 1 (selection without protein target).
seq_matrix_2: Sequencing count for the negative control / matrix replicate 2.
seq_matrix_3: Sequencing count for the negative control / matrix replicate 3.
seq_load: Pre‑selection sequencing count (rough estimate of each molecule’s abundance in the library prior to selection).
y: Training label used in KinDEL’s benchmarks; in the released 1M training subsets this column is the dataset’s target_enrichment value (a Poisson‑enrichment score comparing target vs control/matrix across replicates).

Basic Exploratory Data Analysis (EDA)

We begin with a simple view of sequencing counts by condition.

Code

data_seq_counts = data.melt(id_vars=['molecule_hash'], value_vars=['seq_target_1', 'seq_target_2', 'seq_target_3', 'seq_matrix_1', 'seq_matrix_2', 'seq_matrix_3', 'seq_load'], value_name='seq_count', var_name='condition')

p = (
    gg.ggplot(data_seq_counts, gg.aes(x = 'condition', y='seq_count'))
    + gg.geom_jitter(width=0.05, height=0, size=0.1)
    + gg.labs(x='Condition', y='Sequencing Count', title='Distribution of Sequencing Counts by Condition')
)
p.save("img/kindel_eda_seq_counts.png")

Figure 3: Distribution of counts by condition for the KinDEL DDR1 `split1_random` training subset.

Matrix binders

A key observation from the plot is that some compounds have very high sequencing counts in the matrix (control) conditions. The distribution of matrix counts appears bimodal. This suggests that some molecules may bind strongly to the matrix itself, rather than only the intended protein target. Such molecules could give high signals in both the target and matrix conditions, making it challenging to tell if they’re true protein binders or simply show nonspecific, matrix-binding effects.

Strong low-count mass across all conditions

A high-density region near low counts suggests that most molecules have very low sequencing counts. This is typical of DEL data, where most barcodes are rare or absent in a given selection. This indicates a sparse count structure consistent with large library size and small per-barcode probabilities.

Figure 5: Most molecules have very low counts, indicating sparse DEL data.

A better way to visualise this is to plot the cumulative distribution function (CDF) of the sequencing counts.

What is an ECDF plot, and how should we interpret it?

An ECDF (Empirical Cumulative Distribution Function) shows the cumulative proportion of observations that are less than or equal to a given value. Mathematically, it is defined as:

\[ F(x) = \Pr(X \leq x) \]

An ECDF transforms raw data into a cumulative probability view, making distributional structure and percentiles immediately interpretable without relying on arbitrary bins or model assumptions. If the curve is at 0.75 when \(x=10\), then 75% of the observations are less than or equal to 10.

The axes are:

x-axis: the variable of interest (e.g. sequencing count)
y-axis: the cumulative probability

The shape tells us:

Steep section: high density of observations near the value
Flat section: low density of observations near the value
Slope: rate of change of the cumulative probability

Code

# Plot ECDF faceted by condition, and add vertical line at the 85% quantile for each
#| output: false
seq_conditions = [
    'seq_target_1', 'seq_target_2', 'seq_target_3',
    'seq_matrix_1', 'seq_matrix_2', 'seq_matrix_3'
]

# Melt data for plotnine
data_ecdf = data.melt(
    id_vars=['molecule_hash'],
    value_vars=seq_conditions,
    var_name='condition',
    value_name='seq_count'
)

# Compute the 85th percentile (quantile) for each condition
ecdf_thresholds = (
    data_ecdf.groupby('condition')['seq_count']
    .quantile(0.85)
    .reset_index()
    .rename(columns={'seq_count': 'quantile_85'})
)

# Merge quantiles back for vertical line drawing
# (not strictly necessary, just for convenience)
data_ecdf = data_ecdf.merge(ecdf_thresholds, on='condition', how='left')

# Prepare dataframe for vertical lines
vline_data = ecdf_thresholds.copy()
vline_data['y_start'] = 0
vline_data['y_end'] = 0.85


ecdf_plot = (
    gg.ggplot(data_ecdf, gg.aes(x='seq_count'))
    + gg.stat_ecdf(size=1)
    + gg.geom_vline(
        data=vline_data,
        mapping=gg.aes(xintercept='quantile_85'),
        linetype="dashed", color="red"
    )
    + gg.geom_label(
        data=vline_data.assign(y_pos=0.5),
        mapping=gg.aes(
            x='quantile_85',
            y='y_pos',
            label='round(quantile_85, 1)'
        ),
        color="red",
        fill="white",
        size=11,
        ha='left',
        va='center',
        nudge_x=1,
        label_padding=0.4,
        format_string="{:.0f}"
    )
    + gg.facet_wrap('~condition', scales='free')
    + gg.labs(
        x='Sequencing Count',
        y='Cumulative Probability',
        title='ECDF of Sequencing Counts by Condition (with 85% Quantile)'
    )
    + gg.theme_bw()
    + gg.theme(strip_text=gg.element_text(size=11))
)

ecdf_plot.save("img/kindel_ecdf_by_condition.png")

Figure 6: Cumulative distribution function (CDF) of the sequencing counts for all conditions with the 85th percentile marked.

The ECDF indicates that most molecules have very low sequencing counts, and the patterns differ between the matrix and target selections. This difference reflects the distinct biological and technical contexts of each selection method.

For the matrix selection, the distribution mirrors the composition of the input library, suggesting that only a small subset of molecules bind to the matrix. This is a desirable outcome because it indicates limited non-specific binding.

In contrast, the target selection distribution shows higher counts at the 85th percentile, demonstrating that the target protein enriches for specific compounds. This suggests successful binding or functional selection of molecules by the target protein.

Discrete horizontal bands

The appearance of discrete horizontal bands in the plot reflects repeated, identical sequencing count values across many molecules. This happens because next-generation sequencing produces integer counts, and low-abundance molecules are often observed at the same small values. As a result, low integer counts are repeated across many molecules, producing visible steps in the ECDF (Figure 6) and horizontal banding in scatter-style plots (Figure 7). This effect is especially noticeable when sequencing depth is modest relative to library complexity or when many molecules have similarly low abundance.

We next plot the frequency of each sequencing count value (\(\leq 20\)) for all conditions. This directly shows why the bands appear: thousands of molecules share the same low integer count.

Code

# Using the already-melted data_ecdf from earlier
count_of_counts = (
    data_ecdf
    .groupby(['condition', 'seq_count'])
    .size()
    .reset_index(name='n_molecules')
)

# Truncate at, say, count=20 for visibility
count_of_counts_trunc = count_of_counts[count_of_counts['seq_count'] <= 20]


frequency_of_counts_plot = (
    gg.ggplot(count_of_counts_trunc, gg.aes(x='factor(seq_count)', y='n_molecules'))
    + gg.geom_col(fill='steelblue', width=0.7)
    + gg.facet_wrap('~condition', scales='free_y')
    + gg.labs(
        x='Sequencing Count',
        y='Number of Molecules',
        title='Frequency of Each Sequencing Count Value (≤ 20)'
    )
    + gg.theme_bw()
    + gg.theme(axis_text_x=gg.element_text(size=8, rotation=90), figure_size=(15, 4))
)

frequency_of_counts_plot.save("img/kindel_frequency_of_counts.png")

Figure 8: Frequency of each sequencing count value (\(\leq 20\)) for all conditions.

Figure 8 confirms that low integer counts dominate across conditions: counts such as 0, 1, 2, and 3 occur for very large numbers of molecules, while higher counts are much rarer. This pile-up at a few small integer values is what produces the visible step-like ECDF behaviour and discrete horizontal bands seen in the earlier plots.

Correlations between replicates

Each experimental selection (both matrix and target) was performed in three independent replicates:

To assess reproducibility, we compare replicate-level sequencing proportions for both matrix and target selections. For each replicate, molecule counts are normalised by the replicate total and then compared pairwise. These comparisons are summarised in three complementary ways: scatter plots (Figure 10) show pattern shape and deviation from the \(y=x\) line, a Spearman correlation heatmap (Figure 11) provides a compact quantitative summary of pairwise agreement, and a replicate-support plot (Figure 12) shows how often molecules are observed at all across replicates.

Code

data_seq_counts_matrix = data[['seq_matrix_1', 'seq_matrix_2', 'seq_matrix_3']]

# compute the proportion of the sequencing counts for the matrix conditions
data_seq_counts_matrix['seq_matrix_1_prop'] = data_seq_counts_matrix['seq_matrix_1'] / data_seq_counts_matrix['seq_matrix_1'].sum()
data_seq_counts_matrix['seq_matrix_2_prop'] = data_seq_counts_matrix['seq_matrix_2'] / data_seq_counts_matrix['seq_matrix_2'].sum()
data_seq_counts_matrix['seq_matrix_3_prop'] = data_seq_counts_matrix['seq_matrix_3'] / data_seq_counts_matrix['seq_matrix_3'].sum()

data_seq_counts_target = data[['seq_target_1', 'seq_target_2', 'seq_target_3']]

# compute the proportion of the sequencing counts for the target conditions
data_seq_counts_target['seq_target_1_prop'] = data_seq_counts_target['seq_target_1'] / data_seq_counts_target['seq_target_1'].sum()
data_seq_counts_target['seq_target_2_prop'] = data_seq_counts_target['seq_target_2'] / data_seq_counts_target['seq_target_2'].sum()
data_seq_counts_target['seq_target_3_prop'] = data_seq_counts_target['seq_target_3'] / data_seq_counts_target['seq_target_3'].sum()


p_1 =(
    gg.ggplot(data_seq_counts_matrix, gg.aes(x = 'seq_matrix_1_prop', y='seq_matrix_2_prop'))
    + gg.geom_point(position='jitter', size=0.1, alpha=0.5)
    + gg.geom_abline(slope=1, intercept=0, color='red')
    + gg.labs(x='Sequencing Proportion for Matrix 1', y='Sequencing Proportion for Matrix 2') 

)

p_2 = (
    gg.ggplot(data_seq_counts_matrix, gg.aes(x = 'seq_matrix_1_prop', y='seq_matrix_3_prop'))
    + gg.geom_point(position='jitter', size=0.1, alpha=0.5)
    + gg.geom_abline(slope=1, intercept=0, color='red')
    + gg.labs(x='Sequencing Proportion for Matrix 1', y='Sequencing Proportion for Matrix 3')
)

p_3 = (
    gg.ggplot(data_seq_counts_matrix, gg.aes(x = 'seq_matrix_2_prop', y='seq_matrix_3_prop'))
    + gg.geom_point(position='jitter', size=0.1, alpha=0.5)
    + gg.geom_abline(slope=1, intercept=0, color='red')
    + gg.labs(x='Sequencing Proportion for Matrix 2', y='Sequencing Proportion for Matrix 3')
)

p_4 =(
    gg.ggplot(data_seq_counts_target, gg.aes(x = 'seq_target_1_prop', y='seq_target_2_prop'))
    + gg.geom_point(position='jitter', size=0.1, alpha=0.5)
    + gg.geom_abline(slope=1, intercept=0, color='red')
    + gg.labs(x='Sequencing Proportion for Target 1', y='Sequencing Proportion for Target 2') 

)

p_5 = (
    gg.ggplot(data_seq_counts_target, gg.aes(x = 'seq_target_1_prop', y='seq_target_3_prop'))
    + gg.geom_point(position='jitter', size=0.1, alpha=0.5)
    + gg.geom_abline(slope=1, intercept=0, color='red')
    + gg.labs(x='Sequencing Proportion for Target 1', y='Sequencing Proportion for Target 3')
)

p_6 = (
    gg.ggplot(data_seq_counts_target, gg.aes(x = 'seq_target_2_prop', y='seq_target_3_prop'))
    + gg.geom_point(position='jitter', size=0.1, alpha=0.5)
    + gg.geom_abline(slope=1, intercept=0, color='red')
    + gg.labs(x='Sequencing Proportion for Target 2', y='Sequencing Proportion for Target 3')
)



replicate_corr_plot = (p_1 | p_2 | p_3) / (p_4 | p_5 | p_6) + gg.theme(figure_size=(15, 8))
replicate_corr_plot.save("img/Replicates_Correlations.png", verbose=False)

Figure 10: Correlations between replicates for the matrix and target selection conditions.

In Figure 10, stronger clustering along the red \(y=x\) line indicates better agreement between replicate pairs. Agreement is generally clearer for matrix replicates and for higher-abundance molecules. The broad spread at very low proportions is expected in sparse DEL data, where sampling noise dominates; this reduces apparent concordance for low-abundance molecules without necessarily indicating experimental failure.

Why does sampling noise dominate at low proportions?

Sequencing is a sampling process. Each molecule \(i\) has some true (but unknown) proportion \(p_i\) in the library. Under the rare-event / large-depth regime, when we sequence a total of \(N\) reads, the observed count \(n_i\) for molecule \(i\) is approximately Poisson-distributed (see the Why under these conditions, the binomial distribution converges to a Poisson distribution? box in the Multinomial / Dirichlet-Multinomial section for more details):

\[n_i \sim \text{Poisson}(\lambda_i), \quad \lambda_i = N \cdot p_i\]

The Poisson distribution has the property that its variance equals its mean:

\[\mathbb{E}[n_i] = \lambda_i, \quad \text{Var}(n_i) = \lambda_i\]

The observed proportion is \(\hat{p}_i = n_i / N\). Its variance is:

\[\text{Var}(\hat{p}_i) = \frac{p_i}{N}\]

and the standard deviation is:

\[\text{SD}(\hat{p}_i) = \sqrt{\text{Var}(\hat{p}_i)} = \sqrt{\frac{p_i}{N}}\]

The key quantity is the coefficient of variation (CV), which measures relative noise, or how large the noise is compared to the signal: \(CV = \frac{SD}{Mean}\). It is a dimensionless quantity that is always positive. The higher the CV, the more noise there is compared to the signal.

\[ \text{CV}(\hat{p}_i) = \frac{\text{SD}(\hat{p}_i)}{\mathbb{E}[\hat{p}_i]} = \frac{\sqrt{p_i / N}}{p_i} = \frac{\sqrt{p_i} \;/\; \sqrt{N}}{p_i} = \frac{\sqrt{p_i}}{\sqrt{N} \cdot p_i} \]

Divide both numerator and denominator by \(\sqrt{p_i}\): \[ \frac{\sqrt{p_i}}{\sqrt{N} \cdot p_i} = \frac{\sqrt{p_i} \;/\; \sqrt{p_i}}{(\sqrt{N} \cdot p_i) \;/\; \sqrt{p_i}} = \frac{1}{\sqrt{N} \cdot \sqrt{p_i}} = \frac{1}{\sqrt{N \cdot p_i}} \]

This means that the CV is: \[ \boxed{CV(\hat{p}_i) = \frac{1}{\sqrt{\lambda_i}}} \]

This relationship has a straightforward interpretation:

When \(N \cdot p_i\) is large (high-abundance molecules), the CV is small. The observed proportion \(\hat{p}_i\) is a precise estimate of \(p_i\), so two replicates will agree closely.
When \(N \cdot p_i\) is small (low-abundance molecules), the CV blows up. For example, if the expected count is \(\lambda_i = 1\), then \(\text{CV} = 100\%\). The actual count could easily be 0, 1, 2, or 3, which is wildly different relative to the expectation.

Since each replicate draws counts independently, their sampling errors are uncorrelated. When both replicates have high relative noise (low \(p_i\)), the scatter in the replicate-vs-replicate plot compounds the noise from both, producing the characteristic fan-shaped spread at low proportions.

Why do matrix and target replicates show different correlation patterns?

Matrix selection

The matrix condition has no target protein present. Molecules are selected purely based on non-specific affinity for the matrix itself. Because this process is largely driven by physical chemistry (e.g., charge, hydrophobicity) rather than specific binding events, the enrichment is relatively stable and reproducible across replicates. As a result, a molecule that gets a high count in matrix replicate 1 tends to get similarly high counts in matrix replicates 2 and 3, and the points cluster more closely along the diagonal.

Target selection

The target selection involves a specific biological interaction between each molecule and the protein (DDR1 in this case). Several factors introduce extra variability here:

Stochastic binding events: True binders associate and dissociate with the protein in a probabilistic way, and this randomness gets amplified differently in each independent replicate. Here, stochastic binding events are the main source of variability.
Rare true hits: The vast majority of molecules do not bind the target protein at all, only a tiny fraction are true binders. For these rare binders, the number of molecules that make it through the entire process (binding, washing, elution, PCR, and sequencing) is very small to begin with. As a result, even minor random changes or noise in any step can cause large swings in their observed counts between replicates. Here, low abundance makes any small random variation in any step matter more in relative terms.
PCR amplification noise: This multiplicative noise affects any enriched compound. Target-enriched compounds tend to have higher absolute counts (higher \(n_{\text{true}}\)), so the variance scales up more dramatically.

We can make this more explicit with a simple multiplicative model of PCR amplification. Each molecule is copied many times, and the number of copies varies stochastically. A simple model is:

\[ n_{\text{obs}} = n_{\text{true}} \cdot \epsilon \]

Where:
- \(n_{\text{obs}}\) is the observed count after PCR amplification.
- \(n_{\text{true}}\) is the true count before PCR amplification.
- \(\epsilon\) is the amplification noise (log-normal or gamma distributed), which is a random variable with \(\mathbb{E}[\epsilon] = 1\) (so, \(\mathbb{E}[n_{\text{obs}}] = n_{\text{true}}\)) and \(\epsilon\) is independent of \(n_{\text{true}}\).
So, the observed count is a product of the true count, \(n_{\text{true}}\), and the amplification noise, \(\epsilon\) (a random variable).

Now, the variance of a constant times a random variable is the square of the constant times the variance of the random variable: \[ \text{Var}(c \cdot X) = c^2 \cdot \text{Var}(X) \]

This follows from the properties of variance: \[ \text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2 - 2X\mathbb{E}[X] + (\mathbb{E}[X])^2] = \mathbb{E}[X^2] - 2\mathbb{E}[X]\mathbb{E}[X] + (\mathbb{E}[X])^2 = \mathbb{E}[X^2] - 2\mathbb{E}[X]^2 + \mathbb{E}[X]^2 = \mathbb{E}[X^2] - \mathbb{E}[X]^2 \]

So,

\[\text{Var}(cX) = \mathbb{E}[(cX)^2] - (\mathbb{E}[cX])^2 = c^2 \mathbb{E}[X^2] - c^2 (\mathbb{E}[X])^2 = c^2[\mathbb{E}[X^2] - (\mathbb{E}[X])^2] = c^2 \,\text{Var}(X)\]

Therefore, applying this to the simple model, we get:

\[ \text{Var}(n_{\text{obs}}) = n_{\text{true}}^2 \cdot \text{Var}(\epsilon) \]

That is why the variance of target-enriched compounds, with higher \(n_{\text{true}}\), scales up more dramatically.

To complement Figure 10, we summarise replicate agreement with a Spearman correlation matrix. Spearman correlation is rank-based, so it asks whether molecules ranked high in one replicate also tend to rank high in another, without assuming a strictly linear relationship.

Code

# collect replicate counts
rep_counts = data[
    [
        "seq_matrix_1", "seq_matrix_2", "seq_matrix_3",
        "seq_target_1", "seq_target_2", "seq_target_3",
    ]
].copy()

# convert counts to within-replicate proportions
for col in rep_counts.columns:
    rep_counts[f"{col}_prop"] = rep_counts[col] / rep_counts[col].sum()

rep_prop_cols = [
    "seq_matrix_1_prop", "seq_matrix_2_prop", "seq_matrix_3_prop",
    "seq_target_1_prop", "seq_target_2_prop", "seq_target_3_prop",
]

# Spearman correlation matrix across replicate proportions
corr_tbl = rep_counts[rep_prop_cols].corr(method="spearman")

# lower-triangle heatmap
mask = np.triu(np.ones_like(corr_tbl, dtype=bool))

plt.figure(figsize=(6, 5))
ax = sns.heatmap(
    corr_tbl,
    mask=mask,
    annot=True,
    fmt=".2f",
    cmap="vlag",
    vmin=-1,
    vmax=1,
    square=True,
    cbar_kws={"shrink": 0.8},
)
ax.set_title("Spearman Correlation Matrix Across Replicates (Proportions)")
plt.tight_layout()
plt.savefig("img/kindel_replicate_heatmap.png")

Figure 11: Spearman correlation matrix across matrix and target replicates using within-replicate proportions.

One more view is useful here: for each molecule, we count in how many of the three replicate experiments the sequencing count is greater than zero. This gives a simple measure of replicate support. A molecule detected in all three replicates has much stronger support than a molecule detected in only one replicate, while a molecule with zero counts in all replicates is a clear example of dropout in that condition. Framing the data this way makes the prevalence of dropout immediately visible and complements the pairwise correlation plots above, which show agreement only among the molecules that are observed.

Code

replicate_support = pd.concat(
    [
        pd.DataFrame({
            "selection": "Matrix",
            "n_nonzero_replicates": (data[["seq_matrix_1", "seq_matrix_2", "seq_matrix_3"]] > 0).sum(axis=1),
        }),
        pd.DataFrame({
            "selection": "Target",
            "n_nonzero_replicates": (data[["seq_target_1", "seq_target_2", "seq_target_3"]] > 0).sum(axis=1),
        }),
    ],
    ignore_index=True,
)

support_summary = (
    replicate_support
    .groupby(["selection", "n_nonzero_replicates"])
    .size()
    .reindex(
        pd.MultiIndex.from_product(
            [["Matrix", "Target"], [0, 1, 2, 3]],
            names=["selection", "n_nonzero_replicates"],
        ),
        fill_value=0,
    )
    .reset_index(name="n_molecules")
)
support_summary["percent"] = (
    100 * support_summary["n_molecules"]
    / support_summary.groupby("selection")["n_molecules"].transform("sum")
)
support_summary["replicate_support_label"] = pd.Categorical(
    support_summary["n_nonzero_replicates"].map({
        0: "Detected in 0 replicates",
        1: "Detected in 1 replicate",
        2: "Detected in 2 replicates",
        3: "Detected in 3 replicates",
    }),
    categories=[
        "Detected in 0 replicates",
        "Detected in 1 replicate",
        "Detected in 2 replicates",
        "Detected in 3 replicates",
    ],
    ordered=True,
)

replicate_support_plot = (
    gg.ggplot(
        support_summary,
        gg.aes(
            x="replicate_support_label",
            y="percent",
            fill="selection",
        ),
    )
    + gg.geom_col(width=0.7, show_legend=False)
    + gg.facet_wrap("~selection", nrow=1)
    + gg.labs(
        x="Detection across replicates",
        y="Percent of molecules",
        title="Replicate support differs strongly between matrix and target selections",
    )
    + gg.theme_bw()
    + gg.theme(
        figure_size=(10, 4),
        axis_text_x=gg.element_text(rotation=20, ha="right"),
    )
)

replicate_support_plot.save("img/kindel_replicate_support.png")

Figure 12: Fraction of molecules observed in 0, 1, 2, or 3 replicates for matrix and target selections.

Together, Figure 10, Figure 11, and Figure 12 support a consistent interpretation:

within-condition replicate correlations are positive and informative (typically stronger for matrix than target), while low-abundance regions remain noisy due to sampling variation. Cross-condition correlations (matrix vs target) are expected to be lower because they reflect different biological/technical mechanisms.

The replicate-support view adds an important nuance: many molecules are absent from all three matrix replicates, whereas target counts are much more often observed in two or three replicates. This reinforces two modelling points:

First, dropout is common and should not automatically be interpreted as a structural zero.
Second, reproducibility depends on condition, which argues for explicit replicate-level noise rather than treating all zeros or disagreements as biologically meaningful.

When fitting count models, this replicate-to-replicate variation should be modelled explicitly rather than ignored.

Relationship between pre and post-selection counts

Pre-selection counts are given by the seq_load column. They represent the abundance of each molecule in the library before any matrix or target selection takes place. That is, they capture how many copies of each barcode were present before exposure to either condition. This pre-selection abundance information helps distinguish true post-selection enrichment from molecules that were simply more abundant in the input library.

Figure 13: Comparing pre- and post-selection counts allows us to account for differences in the original library composition.

Molecules might have higher pre-selection counts due to:

Synthesis variability: Some molecules are overrepresented in the input library due to differences in chemical synthesis.
Sequence preferences: Certain sequences are preferentially amplified during library preparation.

We start by plotting the relationship between pre-selection counts and post-selection counts for the matrix and target selection conditions. Here, we take the mean post-selection count across the three matrix replicates and the three target replicates, then log-transform both the pre-selection counts and the mean post-selection counts to make the plot easier to interpret.

Why use log1p(x) here?

DEL sequencing counts and means are highly skewed and span several orders of magnitude, with many zeros and a few very large values.

log1p(x) makes the plot easier to interpret, because:

log1p(0) = 0 (so zeros remain zeros, avoiding -inf)
log1p(x) compresses large values, reduces skew, and reveals structure across the full dynamic range. For example, a value of 1000 becomes log1p(1000) = 6.9, which is much more manageable and interpretable.
Without log-transform, details at low and medium counts would be obscured by a few outliers (e.g., very large values).

Code

rel = data.copy()
rel["matrix_mean"] = rel[["seq_matrix_1", "seq_matrix_2", "seq_matrix_3"]].mean(axis=1)
rel["target_mean"] = rel[["seq_target_1", "seq_target_2", "seq_target_3"]].mean(axis=1)

rel["log_seq_load"] = np.log1p(rel["seq_load"])
rel["log_matrix_mean"] = np.log1p(rel["matrix_mean"])
rel["log_target_mean"] = np.log1p(rel["target_mean"])


rel_long = pd.concat(
    [
        rel[["log_seq_load", "log_matrix_mean"]].rename(columns={"log_seq_load": "x", "log_matrix_mean": "y"}).assign(panel="Load vs Matrix"),
        rel[["log_seq_load", "log_target_mean"]].rename(columns={"log_seq_load": "x", "log_target_mean": "y"}).assign(panel="Load vs Target"),        
    ],
    ignore_index=True,
)

pre_post_scatter_plot = (
    gg.ggplot(rel_long, gg.aes(x="x", y="y"))
    + gg.geom_point(alpha=0.1, size=0.25)
    + gg.geom_abline(slope=1, intercept=0, color="red")
    + gg.facet_wrap("~panel", scales="free")
    + gg.labs(
        x="log1p (seq_load)", 
        y="log1p (matrix_mean) or log1p (target_mean)", 
        title="Pairwise relationships: seq_load, matrix, and target"
    )
    + gg.theme_bw()
)
pre_post_scatter_plot.save("img/kindel_pre_post_scatter.png", verbose=False)

Figure 14: Pairwise relationships between `seq_load`, matrix mean, and target mean counts on the log scale.

The scatter plots of log-transformed pre-selection counts (seq_load) versus post-selection mean counts (matrix and target) show three consistent patterns:

Dense cloud below the diagonal:
Most molecules lie below the \(y=x\) line, meaning post-selection counts are lower than pre-selection counts. This is expected for non-binders that are depleted during selection. The visible horizontal and vertical banding reflects integer count discreteness after log transformation.
High-count points near the diagonal:
Molecules near the diagonal have post-selection counts close to their initial abundance. Many of these likely reflect input-library overrepresentation (for example, synthesis or sequencing bias as mentioned above) rather than selective enrichment. So, large counts alone should not be interpreted as evidence of binding.
Points above the diagonal (enrichment):
Molecules above \(y=x\) have higher post-selection than pre-selection counts. In “Load vs Matrix,” these are plausible non-specific matrix binders; in “Load vs Target,” they are candidate target binders (e.g., DDR1). A larger above-diagonal population in the target panel suggests enrichment beyond background matrix effects.

What is the modelling implication of the above plots?

Because high post-selection counts can reflect either true enrichment or high initial abundance, the model should adjust explicitly for pre-selection abundance (seq_load).

Figure 14 shows the relationship between pre- and post-selection counts for the matrix and target selection conditions. We can also provide a concise quantitative summary of how strongly these variables move together.

Again, Spearman correlation is a good choice since it is rank-based, asking whether molecules with higher seq_load also tend to have higher matrix or target counts, without assuming a linear relationship. The Spearman correlation matrix therefore complements Figure 14 by turning the visual trends into directly comparable pairwise association measures.

Code

# compute the Spearman correlation matrix
corr_tbl = rel[["log_seq_load", "log_matrix_mean", "log_target_mean"]].corr(method="spearman")
# Plot the Spearman correlation matrix as a lower triangle heatmap

# Create a mask for the upper triangle
mask = np.triu(np.ones_like(corr_tbl, dtype=bool))

plt.figure(figsize=(5, 4))
ax = sns.heatmap(
    corr_tbl, 
    mask=mask, 
    annot=True, 
    fmt=".2f", 
    cmap="vlag", 
    vmin=-1, 
    vmax=1, 
    square=True,
    cbar_kws={"shrink": .8}
)
plt.title("Spearman Correlation Matrix (log-transformed)\nPre-selection, Matrix, Target")
plt.tight_layout()
plt.savefig("img/kindel_pre_post_corr.png")

Figure 15: Spearman correlation matrix between log-transformed pre-selection abundance (`seq_load`), matrix counts, and target counts.

The correlation matrix shows high correlation between seq_load and matrix counts, and lower correlation between seq_load and target counts. This is consistent with the visual patterns in Figure 14: matrix counts mostly track library composition and have a tighter scatter, whereas target counts reflect enrichment rather than just library composition and therefore have a looser scatter.

Another useful quantity to examine is the correlation between matrix and target counts. High correlation here suggests that target counts may be dominated by background effects.

Relationship between matrix and target counts

Exploring the relationship between matrix and target counts helps us understand whether the target counts are dominated by background effects or are truly enriched by the target protein.

Code

rel_long = pd.concat(
    [
        rel[["log_matrix_mean", "log_target_mean"]].rename(columns={"log_matrix_mean": "x", "log_target_mean": "y"}).assign(panel="Matrix vs Target"),
    ],
    ignore_index=True,
)
matrix_target_scatter_plot = (
    gg.ggplot(rel_long, gg.aes(x="x", y="y"))
    + gg.geom_point(alpha=0.1, size=0.25)
    + gg.geom_abline(slope=1, intercept=0, color="red")
    + gg.labs(x="log1p (matrix_mean)", y="log1p (target_mean)", title="Pairwise relationships: matrix, target")
    + gg.theme_bw()
)
matrix_target_scatter_plot.save("img/kindel_matrix_target_scatter.png", verbose=False)

Figure 17: Relationship between mean matrix and target replicate counts on the log scale.

This plot compares the mean matrix and target counts (across three replicates) for each molecule, with the red line indicating equality (i.e., \(y = x\)). The vertical lines are due to integer-valued counts in the matrix and target selections.

While there is a general positive relationship (0.67 based on the Spearman correlation in Figure 15), reflecting that target counts include matrix-derived background, many points appear above the diagonal. In other words, many molecules have higher target counts than matrix counts. This indicates specific enrichment in the target condition beyond background effects.

Crude enrichment relative to matrix background

Because a molecule can be abundant in the input library and still show non-specific matrix binding, it is useful to combine the previous two ideas into a single diagnostic. A simple way to do that is to plot a crude target-over-matrix enrichment score against pre-selection abundance. This is not a substitute for a probabilistic model, but it provides an intuitive preview of the modelling target.

Code

rel["log2_target_over_matrix"] = np.log2((rel["target_mean"] + 1) / (rel["matrix_mean"] + 1))

enrichment_plot = (
    gg.ggplot(
        rel,
        gg.aes(x="log_seq_load", y="log2_target_over_matrix"),
    )
    + gg.geom_bin_2d(bins=70)
    + gg.geom_hline(yintercept=0, color="#d7301f", linetype="dashed", size=0.7)
    + gg.scale_fill_gradient(
        low="#deebf7",
        high="#08519c",
        name="Number of\nmolecules",
    )
    + gg.labs(
        x="log1p(pre-selection count)",
        y="log2((target mean + 1) / (matrix mean + 1))",
        title="Target-over-matrix enrichment across pre-selection abundance",
    )
    + gg.theme_bw()
    + gg.theme(
        figure_size=(7, 5),
        panel_grid_minor=gg.element_blank(),
        panel_grid_major=gg.element_line(color="#e5e5e5", size=0.3),
        plot_title=gg.element_text(weight="bold"),
    )
)

enrichment_plot.save("img/kindel_target_over_matrix_enrichment_vs_load.png", verbose=False)

Figure 18: Crude target-over-matrix enrichment plotted against pre-selection abundance. The dashed line marks equal target and matrix signal.

This plot brings the previous EDA results together in a single view. The x-axis shows how abundant a molecule was before selection, while the y-axis shows whether its average target count is higher or lower than its average matrix count.

The dashed horizontal line at zero is the key reference point:

Molecules near zero have similar target and matrix counts, so their signal is consistent with background rather than clear target-specific enrichment;
Molecules below zero have lower target than matrix signal, which makes them poor target-binder candidates;
Molecules above zero have excess target signal relative to matrix background, making them more plausible target-specific binders.

Most importantly, molecules with positive enrichment appear across a range of pre-selection abundances, not only among the most abundant compounds in the input library. The quantity we care about is not raw count size by itself, but target-specific enrichment after accounting for both matrix background and input abundance.

Relationship between mean and variance of the sequencing counts

The mean–variance relationship is a useful diagnostic for over/under-dispersion.

Overdispersion: The variance is greater than the mean.
Underdispersion: The variance is less than the mean.

In DEL data, underdispersion is often rare and overdispersion is often the norm.

The following factors contribute to inflating the variance above the mean:

PCR amplification
Molecule-specific amplification biases
Replicate-to-replicate technical variability.

True underdispersion is uncommon; when variance appears below the mean, it is often due to small-sample noise (e.g., with only a few replicates) rather than a genuinely underdispersed process.

As seen in Figure 3, the sequencing count distributions are right-skewed and heavy-tailed. That suggests large variability, but it does not by itself establish overdispersion in the strict statistical sense. To assess that directly, we compare the sample mean and variance across replicates for the matrix and target conditions.

Code

# Compute mean and variance for both matrix and target replicates
mv = pd.DataFrame({
    'mean_matrix': data[['seq_matrix_1', 'seq_matrix_2', 'seq_matrix_3']].mean(axis=1),
    'var_matrix':  data[['seq_matrix_1', 'seq_matrix_2', 'seq_matrix_3']].var(axis=1),
    'mean_target': data[['seq_target_1', 'seq_target_2', 'seq_target_3']].mean(axis=1),
    'var_target':  data[['seq_target_1', 'seq_target_2', 'seq_target_3']].var(axis=1),
})

# Melt for long format for plotting
mv_long = pd.concat([
    mv[['mean_matrix', 'var_matrix']].rename(columns={'mean_matrix': 'mean', 'var_matrix': 'var'}).assign(condition='Matrix'),
    mv[['mean_target', 'var_target']].rename(columns={'mean_target': 'mean', 'var_target': 'var'}).assign(condition='Target'),
], ignore_index=True)

mean_var_plot = (
    gg.ggplot(mv_long, gg.aes(x='mean', y='var'))
    + gg.geom_point(size=0.3, alpha=0.3)
    + gg.geom_abline(slope=1, intercept=0, color='red', linetype='dashed')
    + gg.scale_x_log10() + gg.scale_y_log10()
    + gg.facet_wrap('~condition', scales='free')
    + gg.labs(
        x='Mean count',
        y='Variance',
        title='Mean–Variance by Condition: points above red line indicate overdispersion'
    )
    + gg.theme_bw()
)

mean_var_plot.save("img/kindel_mean_variance.png", verbose=False)

Figure 19: Mean-variance relationship of the sequencing counts for the matrix and target selection conditions.

The red line marks the Poisson benchmark, where the variance equals the mean:

\[ \mathrm{Var}(X) = \mathrm{E}[X] \]

Points above the red line have variance greater than the mean:

\[ \mathrm{Var}(X) > \mathrm{E}[X] \]

Both matrix and target conditions show clear overdispersion: most points lie above the red Poisson line across the count range.

In the matrix condition, the extra spread likely reflects technical noise (PCR, sequencing) together with non-specific matrix binding. Molecules with similar mean counts can still vary substantially across replicates.

In the target condition, the spread includes the same technical sources plus biological variability from true target binding. Enriched binders can therefore have both high mean and high variance, while non-binders remain concentrated in the low-count region.

In the matrix condition, the points with high mean and low variance on the right side of the plot likely correspond to the compounds present in all three replicates, as discussed in the Matrix binders section.

The horizontal lines in the plots are due to integer-valued counts in the matrix and target selections, as mentioned in the previous section.

What do points below the red line indicate?

Points below the red line do not necessarily imply true underdispersion, where the variance is less than the mean:

\[ \mathrm{Var}(X) < \mathrm{E}[X] \]

Two main reasons for points below the red line:

Small-sample variance noise: With only three replicates, we can get strong estimation noise. Even if the true process is Poisson or NB, some sample variances will fall below the mean purely by chance.
Nearly identical replicates: If the replicates are nearly identical, the variance will be very low.

Example: Suppose we have three replicates: (10,10,11). The mean is 10.33 and the variance is 0.33. The variance is tiny, so the point falls below the red line. In this case, the low variance reflects limited observed replicate variation and does not necessarily indicate biologically meaningful underdispersion.

Which likelihoods are most appropriate here?

We choose likelihood functions based on two EDA findings:

Rare-event, high-depth counts:
As shown in Strong low-count mass across all conditions, most molecules have very low counts. This sparse regime is well described by independent-barcode (rate-based) models such as Poisson, Negative Binomial, and Poisson-Lognormal.

In this setting, fixed-depth (composition-based) models become approximately equivalent to rate-based models, which further supports focusing on the independent-barcode view. (See Two data-generation perspectives for DEL counts, Multinomial / Dirichlet-Multinomial, and Binomial / Beta-Binomial.)
Overdispersion:
In Relationship between mean and variance of the sequencing counts, the variance is consistently larger than the mean, indicating overdispersion. This rules out a pure Poisson assumption and favors likelihoods that explicitly model extra-Poisson variability.

Taken together, the data support Negative Binomial and Poisson-Lognormal as the most appropriate likelihoods for DEL rate-based count modelling in this analysis.

Considerations for modelling parameters of the likelihood functions

The EDA highlights three non-target or technical effects that should be modelled explicitly because they can distort post-selection counts.

Matrix-driven background: As discussed in Matrix binders, some compounds are consistently enriched in the matrix condition, and this pattern is reproducible across replicates (see Correlations between replicates). In addition, Relationship between matrix and target counts shows that matrix and target counts are correlated. Together, these results indicate that part of the target signal can reflect non-specific matrix binding rather than true target affinity.
Library composition effects: Relationship between pre and post-selection counts shows that pre-selection abundance is correlated with post-selection counts. This implies that baseline library representation can propagate into both matrix and target readouts, independently of binding strength.
Replicate-level variation: The Correlations between replicates analysis (including the Spearman summary in Figure 11) shows that replicate agreement is substantial but not perfect, especially in low-abundance regions where sampling and technical noise are strongest. This indicates that replicate-to-replicate shifts should be modelled explicitly, for example with replicate random effects or condition-specific replicate offsets.

In practice, likelihood parameterisation should separate background structure (matrix effects and input composition), replicate-level technical variation, and true target-dependent enrichment. Doing so improves both the accuracy of enrichment estimates and the biological interpretability of model outputs.

What exactly is the modelling target?

At this stage, it is useful to distinguish clearly between the data we observe, the latent quantity we want to infer, and the practical goal of the model.

Observed data: the raw sequencing counts themselves, such as pre-selection (seq_load), matrix counts, and target counts across replicates. These are the quantities modelled directly by the likelihood.
Inferential target: the latent enrichment or binding-related signal that explains why some molecules have higher target counts than would be expected from input abundance and background matrix effects alone.
Practical goal: not merely predicting raw counts, but denoising and ranking molecules so that likely binders are separated from molecules that look strong only because of technical noise, input abundance, or non-specific background binding, while still allowing genuinely promising low-count compounds to be rescued when the overall evidence supports binding.

This distinction matters because a good probabilistic DEL model should treat observed counts as noisy evidence, not as the binding signal itself.

A causal sketch of DEL counts

Taken together, the EDA results suggest that observed counts are not direct measurements of binding. They are the combined result of several overlapping causes, only one of which is the target-binding signal we care about.

A causal sketch makes this explicit by mapping which latent factors feed into the observed counts.

Drawing on everything identified in the EDA, the four latent causes of DEL sequencing counts are:

Library composition is a common upstream cause of all three count types: pre-selection, matrix, and target. A molecule that happens to be over-represented in the synthesised library will appear more often in every condition, regardless of whether it binds anything. This was established in Relationship between pre and post-selection counts.
Matrix background is a shared cause of both matrix and target counts. Non-specific interactions with the solid support inflate post-selection counts in both conditions simultaneously, which is why the Relationship between matrix and target counts shows a positive association even for non-binders.
Target binding is the only cause that flows exclusively into the target counts, on top of what matrix background and library composition already contribute. This is the signal of interest.
Replicate effects and technical noise operate at the observation stage. Because we have replicate matrix and target measurements but not replicate pre-selection counts, this noise enters the model at the point where latent rates become observed post-selection read counts.

Figure 20 summarises this structure in two layers. On the left are the latent causes: library composition or input abundance, matrix-background propensity, target-binding propensity, and condition-specific replicate effects (sampling, PCR, and run variation). These feed into the right layer contains the observed sequencing counts: seq_load for pre-selection, together with replicated matrix (seq_matrix_1, seq_matrix_2, seq_matrix_3) and target (seq_target_1, seq_target_2, seq_target_3) read counts.

Figure 20: Causal sketch of DEL sequencing counts.

The causal sketch also makes it clear why a simple enrichment ratio, for example, dividing target counts by matrix counts, is not a satisfying answer to the DEL hit identification problem. Such a ratio partially cancels the matrix background, but it ignores library composition entirely, so a molecule that is abundant in the input library will still look enriched even if it does not bind the target.

Figure 20 is the blueprint for the modelling approach in Part 2. Each latent cause is translated into a specific component of a probabilistic count model so that true binding signal can be separated from background and technical noise.

Key Takeaways and Next Steps

In this post, we focused on the statistical principles needed to model DEL sequencing counts. The exploratory analyses showed that the data are sparse, dominated by low counts, and clearly overdispersed, which makes rate-based count models such as the Negative Binomial and Poisson-Lognormal more appropriate than simple Poisson models. At the same time, the EDA highlighted three effects that any useful DEL model should handle explicitly: matrix-driven background, library composition, and replicate-level technical variation. These observations motivate both the likelihood choice and the structure of the model introduced in Part 2.

These observations have an important modelling implication: observed post-selection counts should not be interpreted as direct evidence of binding without accounting for background and technical structure. A good DEL model should explain noisy counts through a probabilistic likelihood while separating true enrichment signal from non-specific matrix effects, input abundance, and replicate-to-replicate variability.

In the next post, we will build on this principle by specifying and fitting a count-based probabilistic denoising model for DEL data. The goal will be to model observed counts directly while inferring latent enrichment or binding-related signal in a way that is more robust to technical and compositional artefacts.