Avid Afzal
  • Home
  • Blogs

On this page

  • What Is the Aim of This Blog?
  • Introduction: What Are Variational Autoencoders (VAEs)?
  • Intuition: What a VAE Is Really Learning
  • Assumptions: What Are the Core Assumptions of a VAE Model?
  • Architecture: What Is the Role of VAE’s Encoder and Decoder?
  • Training Objective: How Can We Learn \(\color{green}{\phi}\) and \(\color{blue}{\theta}\) Jointly?
    • Measuring Closeness: The KL Divergence
  • ELBO Derivation: Deriving the Evidence Lower Bound
    • Step 1: Start with the KL Divergence
    • Step 2: Apply Bayes’ Rule to the Posterior
    • Step 3: Split the Logarithm
    • Step 4: Separate the Integral
    • Step 5: Decompose the Joint Distribution
    • Step 6: Simplify Using Logarithm Properties
    • Step 7: Split Into Two Integrals
    • Step 8: Recognize Standard Forms
    • Step 9: Rearrange to Isolate the Evidence (That Is \(p_\theta(x)\))
    • Step 10: The VAE Loss Function
    • Parameter Names: What Are the Parameters of the VAE?
    • Why This Solves Our Problem: What Is the Purpose of the VAE Loss?
  • Limitations: What Are the Key Limitations of VAEs?
    • 1. Blurry Reconstructions
    • 2. Posterior Collapse
    • 3. Limited Expressiveness
    • 4. Inability to Directly Measure True Likelihood
  • Conclusion: Key Takeaways and Final Remarks

Mathematical Derivation of VAE

Update:

This post was updated in March 2026. Notably, the section on Limitations: What Are the Key Limitations of VAEs? has been revised to include a clearer taxonomy of VAE variants. In addition, I have added a new subsection on Jacobian-Regularized VAEs (“Jacobian VAE”), reflecting recent discussions with a colleague on this topic.

What Is the Aim of This Blog?

This blog draws upon insights from this article and this post, with the objective of developing a better understanding of Variational Autoencoders (VAEs) for my own research.

Introduction: What Are Variational Autoencoders (VAEs)?

Variational Autoencoders (VAEs) are a foundational class of latent variable models that combine probabilistic modeling with neural networks, enabling both representation learning and data generation. Unlike standard autoencoders, VAEs provide a principled probabilistic framework that allows us to reason about uncertainty, impose structure on latent spaces, and perform meaningful sampling.

At a high level, VAEs assume that high-dimensional observations are generated from a lower-dimensional latent variable through a stochastic process. Learning such models, however, presents a central challenge: the true posterior distribution over latent variables is intractable. Variational inference resolves this by introducing a tractable approximation and reframing learning as an optimization problem.

The goal of this post is to develop a precise and fully mathematical understanding of how VAEs are derived; from their generative assumptions to the Evidence Lower Bound (ELBO) that is optimized in practice. Rather than focusing on intuition alone, I walk through each step of the derivation, clarifying where approximations are introduced and why the final objective is both computable and effective.

Intuition: What a VAE Is Really Learning

A VAE assumes that each data point can be explained by a small set of hidden factors (latent variables)that capture the essential structure of the data.

The encoder does not map an input to a single point in latent space, but instead learns a distribution over plausible latent representations, reflecting uncertainty about how the data was generated.

The decoder then learns how to probabilistically reconstruct the data from samples drawn from this latent distribution.

Training a VAE is therefore a balancing act: we want latent representations that are expressive enough to reconstruct the data well, while also being regularized to follow a simple prior distribution so that the latent space remains smooth and generative.

Assumptions: What Are the Core Assumptions of a VAE Model?

Assume we have a dataset \(X\):

\[X = [\vec{x}^{(i)}]_{i=1}^N = \{\vec{x}^{(1)}, \vec{x}^{(2)}, \ldots, \vec{x}^{(N)}\}\]

Each \(\vec{x}^{(i)}\) is IID. It can be continuous or discrete-valued.

The VAE framework makes the following fundamental assumptions about how the observed data is generated:

Assumption 1: Latent Prior Distribution

Each latent vector \(\vec{z}^{(i)}\) is drawn from a prior distribution:

\[\vec{z}^{(i)} \sim \color{purple}{p_{\theta^*}(\vec{z})}\]

where \(\color{purple}{p_{\theta^*}(\vec{z})}\) is the \(\color{purple}{\text{prior}}\) over the lower-dimensional latent space, parameterized by \(\color{purple}{\theta^*}\).

Assumption 2: Conditional Likelihood

Each observed data point is generated from its corresponding latent vector through a conditional distribution:

\[\vec{x}^{(i)} \sim \color{blue}{p_{\theta^*}(\vec{x} \mid \vec{z} = \vec{z}^{(i)})}\]

where \(\color{blue}{p_{\theta^*}(\vec{x} \mid \vec{z})}\) is the model’s \(\color{blue}{\text{likelihood}}\) function, representing the probability of generating \(\vec{x}\) given latent code \(\vec{z}\).

In essence, we assume the observed high-dimensional dataset \(X\) is generated by a latent variable model with an underlying lower-dimensional random process \(\vec{z}\).

The goal is to find the parameter \(\color{purple}{\theta^*}\) that makes our observed data as likely as possible under the model. In other words, we want to maximize the likelihood of the data:

\[\color{purple}{\theta^*} {\color{black}{= \arg \max_{\theta}\prod_{i=1}^N}} \color{teal}{p_\theta(\vec{x}^{(i)})}\]

However, it is usually more convenient and numerically stable to work with the log-likelihood (since logs turn products into sums). Therefore, we rewrite the objective as:

\[\color{purple}{\theta^*} \color{black}{ = \arg \max_{\theta}\sum_{i=1}^N \log} \; \color{black}{\left( \color{teal}{p_\theta(\vec{x}^{(i)})} \color{black} \right)}\]

To compute \(\color{teal}{p_\theta(\vec{x}^{i})}\), we marginalize over the latent variable \(z\):

\[\color{teal}{p_\theta(\vec{x}^{i})} \color{black}{=\int} \color{blue}{p_\theta(x|z)}\, \color{orange}{p_\theta(z)}\, \color{black}{dz}\]

However, directly computing this integral is typically intractable because it requires evaluating \(\color{blue}{p_\theta(x|z)}\) for all possible values of \(z\). To make this computation practical, we introduce an auxiliary function, \(\color{green}{q_\phi(z|x)}\), called the variational distribution or approximate posterior. This function, parameterized by \(\color{green}{\phi}\), provides a tractable way to estimate which values of \(z\) are likely given a particular input \(x\).

Architecture: What Is the Role of VAE’s Encoder and Decoder?

To understand the VAE architecture, let’s start with Bayes’ theorem, which relates our key distributions:

\[\color{red}{p_\theta(z|x)} \color{black}{= \frac{\color{orange}{p_\theta(z)} \color{black}{ \times} \color{blue}{p_\theta(x|z)}}{\color{teal}{p_\theta(x)}}}\]

This equation shows that the posterior distribution \(\color{red}{p_\theta(z|x)}\) (the probability of latent code \(z\) given observation \(x\)) can be computed from the prior \(\color{orange}{p_\theta(z)}\), the likelihood \(\color{blue}{p_\theta(x|z)}\), and the evidence \(\color{teal}{p_\theta(x)}\).

The Challenge: Intractable Posterior

The posterior \(\color{red}{p_\theta(z|x)}\) is intractable to compute directly because it requires knowing \(\color{teal}{p_\theta(x)}\), which involves the difficult integral we discussed earlier. This is where the VAE’s two-part architecture comes in:

The Encoder (Recognition Network)

The encoder approximates the intractable posterior \(\color{red}{p_\theta(z|x)}\) using variational inference.

  • We introduce a variational distribution \(\color{green}{q_\phi(z|x)}\) that is designed to be tractable
  • The encoder learns parameters \(\color{green}{\phi}\) to make \(\color{green}{q_\phi(z|x)} \color{black}{\approx} \color{red}{p_\theta(z|x)}\) as close as possible
  • Role: Given an input \(x\), the encoder outputs a distribution over likely latent codes \(z\)
The Decoder (Generative Network)

The decoder models the likelihood \(\color{blue}{p_\theta(x|z)}\), which is the generative part of the model.

  • The decoder learns parameters \(\color{blue}{\theta}\) to map from the latent space back to the data space
  • Role: Given a latent code \(z\), the decoder outputs a distribution over possible reconstructions \(x\)

In summary: the encoder compresses observations \(x\) into latent representations \(z\), while the decoder reconstructs observations from latent codes.

Training Objective: How Can We Learn \(\color{green}{\phi}\) and \(\color{blue}{\theta}\) Jointly?

Now we have two sets of parameters to optimize:

  • Encoder parameters \(\color{green}{\phi}\): control the approximate posterior \(\color{green}{q_\phi(z|x)}\)
  • Decoder parameters \(\color{blue}{\theta}\): control the likelihood \(\color{blue}{p_\theta(x|z)}\) (and also appear in the true posterior \(\color{red}{p_\theta(z|x)}\))

Our goal is to make the estimated posterior \(\color{green}{q_\phi(z|x)}\) as close as possible to the true (but intractable) posterior \(\color{red}{p_\theta(z|x)}\).

Measuring Closeness: The KL Divergence

To measure how “close” two probability distributions are, we use the Kullback-Leibler (KL) divergence. Specifically, we use the reverse KL divergence:

\[\text{KL} \Biggl( \color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)} \color{black}\Biggr) \color{black}{= \mathbb{E}_{z \sim \color{green}{q_\phi(z|x)}} \left[ \log \frac{\color{green}{q_\phi(z|x)}}{\color{red}{p_\theta(z|x)}} \right]}\]

which can be written more explicitly as:

  • For discrete latent variables: \[\text{KL} \Biggl( \color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)} \color{black}\Biggr) \color{black}{= \sum_{z \in Z} \color{green}{q_\phi(z|x)} \color{black}{\log} \left( \color{black}\frac{\color{green}{q_\phi(z|x)}}{\color{red}{p_\theta(z|x)}} \right)}\]

  • For continuous latent variables: \[\text{KL} \Biggl( \color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)}\color{black} \Biggr) \color{black}{= \int \color{green}{q_\phi(z|x)} \color{black}{\log} \left( \color{black}\frac{\color{green}{q_\phi(z|x)}}{\color{red}{p_\theta(z|x)}} \right) \color{black}{dz}}\]

Key Properties of KL Divergence

Understanding these properties is helpful for grasping how VAE training works:

1. Non-negativity: \(\text{KL}(q \; || \; p) \geq 0\) for any two distributions \(q\) and \(p\)

  • The KL divergence is always non-negative or zero
  • This property comes from Jensen’s inequality
  • Intuitively: there’s always some “cost” to using an approximation unless it’s exact

2. Zero if and only if distributions are identical: \(\text{KL}(q \; || \; p) = 0 \iff q = p\) (almost everywhere)

  • When \(q\) and \(p\) are exactly the same, there’s no information loss
  • This is our ideal case: \(\color{green}{q_\phi(z|x)} \color{black}{=} \color{red}{p_\theta(z|x)}\)
  • In practice, we get close but rarely achieve exactly zero

3. Asymmetric (not a distance metric): \(\text{KL}(q \; || \; p) \neq \text{KL}(p \; || \; q)\) in general

  • The order matters! \(\text{KL}\color{black}(\color{green}{q_\phi} \; \color{black}|| \; \color{red}{p_\theta}\color{black})\) is called the reverse KL
  • Reverse KL encourages \(q\) to focus on regions where \(p\) has high probability
    • This results in \(q\) spreading out to cover multiple modes
  • This is why VAE tends to produce “mode-covering” behavior rather than “mode-seeking”
  • Check this blog, where Eric Jang has a great explanation for forward and reverse KL divergence.

4. Measures information loss: It quantifies the expected extra bits (or nats) needed when using \(q\) to encode samples from \(p\)

  • In VAE context: how much information we lose by using \(\color{green}{q_\phi(z|x)}\) instead of the true \(\color{red}{p_\theta(z|x)}\)

By minimizing \(\text{KL} \Biggl( \color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)}\color{black} \Biggr)\), we make our encoder’s approximate posterior as accurate as possible. However, there’s a problem: we can’t directly compute this KL divergence because it involves the intractable posterior \(\color{red}{p_\theta(z|x)}\)!

This is where the ELBO (Evidence Lower Bound) comes in, which we’ll derive next.

ELBO Derivation: Deriving the Evidence Lower Bound

Recall that we want to minimize \(\text{KL} \Biggl( \color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)} \color{black}\Biggr)\), but we can’t compute it directly because it involves the intractable posterior \(\color{red}{p_\theta(z|x)}\).

The variational inference idea is that we can derive an alternative objective that:

  1. Is tractable to compute
  2. Still allows us to optimize both \(\color{green}{\phi}\) and \(\color{blue}{\theta}\)
  3. Automatically handles the intractability

Let’s derive this alternative objective step by step.

Step 1: Start with the KL Divergence

For continuous latent variables, the KL divergence is defined as:

\[\text{KL} \Biggl( \color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)} \color{black}\Biggr) \color{black}{= \int \color{green}{q_\phi(z|x)} \color{black}{\log} \left( \color{black}\frac{\color{green}{q_\phi(z|x)}}{\color{red}{p_\theta(z|x)}} \right) \color{black}{dz}}\]

Step 2: Apply Bayes’ Rule to the Posterior

Using Bayes’ rule, we know that:

\[\color{red}{p_\theta(z|x)} \color{black}{= \frac{p_\theta(x,z)}{\color{teal}{p_\theta(x)}}}\]

Substituting this into our KL divergence:

\[\text{KL} \Biggl( \color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)} \color{black}\Biggr) \color{black}{= \int \color{green}{q_\phi(z|x)} \color{black}{\log} \left( \color{black}\frac{\color{green}{q_\phi(z|x)} \cdot \color{teal}{p_\theta(x)}}{p_\theta(x,z)} \right) \color{black}{dz}}\]

Step 3: Split the Logarithm

Using the property \(\log(ab/c) = \log(a) + \log(b/c)\), where \(a = \color{teal}{p_\theta(x)}\), \(b = \color{green}{q_\phi(z|x)}\), and \(c = p_\theta(x,z)\):

\[\text{KL} \Biggl( \color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)} \color{black}\Biggr) \color{black}{= \int \color{green}{q_\phi(z|x)} \color{black}{\left[ \log(\color{teal}{p_\theta(x)} \color{black}) + \log \left( \frac{\color{green}{q_\phi(z|x)}}{p_\theta(x,z)} \right) \right]} \color{black}{dz}}\]

Step 4: Separate the Integral

Split into two integrals:

\[\text{KL}\Biggl(\color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)}\color{black} \Biggr) \color{black}{= \int \color{green}{q_\phi(z|x)} \color{black} \log(\color{teal}{p_\theta(x)} \color{black} ) \, dz + \int \color{green}{q_\phi(z|x)} \color{black} \log \left( \frac{\color{green}{q_\phi(z|x)}}{p_\theta(x,z)} \right) dz}\]

The Evidence is Constant

The first integral simplifies because \(\color{teal}{p_\theta(x)}\) doesn’t depend on \(z\):

\[\int \color{green}{q_\phi(z|x)} \color{black} \log(\color{teal}{p_\theta(x)} \color{black}) \color{black} \, dz = \log(\color{teal}{p_\theta(x)} \color{black}) \int \color{green}{q_\phi(z|x)} \color{black} \, dz = \log(\color{teal}{p_\theta(x)} \color{black})\]

since \(\int \color{green}{q_\phi(z|x)} \color{black} \, dz = 1\) (probability distributions must integrate to 1).

So now we have:

\[\text{KL}\Biggl(\color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)} \color{black}\Biggr) \color{black}{= \log(\color{teal}{p_\theta(x)} \color{black}) + \int \color{green}{q_\phi(z|x)} \color{black} \log \left( \frac{\color{green}{q_\phi(z|x)}}{p_\theta(x,z)} \right) \color{black}{dz}}\]

Step 5: Decompose the Joint Distribution

The joint distribution can be factored as:

\[p_\theta(x,z) = \color{blue}{p_\theta(x|z)} \cdot \color{purple}{p_\theta(z)}\]

Substituting this:

\[\text{KL}\Biggl(\color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)} \color{black}\Biggr) \color{black}{= \log(\color{teal}{p_\theta(x)} \color{black}) + \int \color{green}{q_\phi(z|x)} \color{black} \log \left( \frac{\color{green}{q_\phi(z|x)}}{\color{blue}{p_\theta(x|z)} \cdot \color{purple}{p_\theta(z)}} \right) \color{black}{dz}}\]

Step 6: Simplify Using Logarithm Properties

Using \(\log(a/(bc)) = \log(a/b) - \log(c)\), where \(a = \color{green}{q_\phi(z|x)}\), \(b = \color{purple}{p_\theta(z)}\), and \(c = \color{blue}{p_\theta(x|z)}\):

\[\text{KL}\Biggl(\color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)} \color{black}\Biggr) \color{black}{= \log(\color{teal}{p_\theta(x)} \color{black}) + \int \color{green}{q_\phi(z|x)} \color{black} \left[ \log \left( \frac{\color{green}{q_\phi(z|x)}}{\color{purple}{p_\theta(z)}} \right) - \log(\color{blue}{p_\theta(x|z)} \color{black}) \right]{dz}}\]

Step 7: Split Into Two Integrals

\[\text{KL}\Biggl(\color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)} \color{black}\Biggr) \color{black}{= \log(\color{teal}{p_\theta(x)} \color{black}) + \int \color{green}{q_\phi(z|x)} \color{black} \log \left( \frac{\color{green}{q_\phi(z|x)}}{\color{purple}{p_\theta(z)}} \right) \color{black}{dz} - \int \color{green}{q_\phi(z|x)} \color{black} \log(\color{blue}{p_\theta(x|z)} \color{black}) \color{black}{dz}}\]

Step 8: Recognize Standard Forms

The second term is the KL divergence between \(\color{green}{q_\phi(z|x)}\) and the prior \(\color{purple}{p_\theta(z)}\):

\[\int \color{green}{q_\phi(z|x)} \color{black} \log \left( \frac{\color{green}{q_\phi(z|x)}}{\color{purple}{p_\theta(z)}} \right) \color{black}{dz} = \text{KL}\Biggl(\color{green}{q_\phi(z|x)} \; || \; \color{purple}{p_\theta(z)}\color{black} \Biggr)\]

The third term is an expectation under \(\color{green}{q_\phi(z|x)}\):

\[\int \color{green}{q_\phi(z|x)} \color{black} \log(\color{blue}{p_\theta(x|z)} \color{black}) \color{black} \, dz = \mathbb{E}_{z \sim \color{green}{q_\phi(z|x)}} \left[ \log(\color{blue}{p_\theta(x|z)}\color{black}) \right]\]

Therefore:

\[\text{KL}\Bigg(\color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)} \color{black}\Biggr) = \log(\color{teal}{p_\theta(x)} \color{black}) + \text{KL}\Biggl(\color{green}{q_\phi(z|x)} \; || \; \color{purple}{p_\theta(z)}\color{black} \Biggr) - \mathbb{E}_{z \sim \color{green}{q_\phi(z|x)}} \left[ \log(\color{blue}{p_\theta(x|z)} \color{black}) \right]\]

Step 9: Rearrange to Isolate the Evidence (That Is \(p_\theta(x)\))

Rearranging the equation:

\[\log(\color{teal}{p_\theta(x)} \color{black}) = \text{KL}\biggl(\color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)} \color{black}\biggr) + \underbrace{\mathbb{E}_{z \sim \color{green}{q_\phi(z|x)}} \left[ \log(\color{blue}{p_\theta(x|z)} \color{black}) \right] - \text{KL}\biggl(\color{green}{q_\phi(z|x)} \; || \; \color{purple}{p_\theta(z)}\color{black} \biggr)}_{\text{ELBO}(\color{blue}{\theta}, \color{green}{\phi}; x)}\]

The Evidence Lower Bound (ELBO)

We define the Evidence Lower Bound (ELBO) as:

\[\text{ELBO}(\color{blue}{\theta}, \color{green}{\phi}; x \color{black}) = \mathbb{E}_{z \sim \color{green}{q_\phi(z|x)}} \left[ \log(\color{blue}{p_\theta(x|z)} \color{black}) \right] - \text{KL}\biggl(\color{green}{q_\phi(z|x)} \; || \; \color{purple}{p_\theta(z)}\color{black} \biggr)\]

Why is it called a “lower bound”?

From the equation in step 9: \[\log(\color{teal}{p_\theta(x)} \color{black}) = \underbrace{\text{KL}\biggl(\color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)} \color{black}\biggr)}_{\geq 0} + \text{ELBO}(\color{blue}{\theta}, \color{green}{\phi}; x \color{black})\]

Rearranging to isolate the ELBO: \[\text{ELBO}(\color{blue}{\theta}, \color{green}{\phi}; x \color{black}) = \log(\color{teal}{p_\theta(x)} \color{black}) - \text{KL}\biggl(\color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)} \color{black}\biggr)\]

Since \(\text{KL} \geq 0\) (always non-negative), we’re subtracting a non-negative value from the log-evidence:

\[\text{ELBO}(\color{blue}{\theta}, \color{green}{\phi}; x \color{black}) = \log(\color{teal}{p_\theta(x)} \color{black}) - \underbrace{\text{KL}}_{\geq 0} \leq \log(\color{teal}{p_\theta(x)} \color{black})\]

Therefore, the ELBO is always less than or equal to the log-evidence, \(\log(\color{teal}{p_\theta(x)}\color{black} )\)! It provides a lower bound! The gap between them is exactly the KL divergence between our approximate and true posteriors.

Step 10: The VAE Loss Function

In practice, we want to minimize a loss function. The VAE loss is simply the negative ELBO:

\[\mathcal{L}_{\text{VAE}}(\color{blue}{\theta}, \color{green}{\phi}; x \color{black}) = -\text{ELBO}(\color{blue}{\theta}, \color{green}{\phi}\color{black}; x )\]

\[\boxed{\mathcal{L}_{\text{VAE}}(\color{blue}{\theta}, \color{green}{\phi}; x \color{black}) = \text{KL}\biggl(\color{green}{q_\phi(z|x)} \; || \; \color{purple}{p_\theta(z)}\color{black} \biggr) - \mathbb{E}_{z \sim \color{green}{q_\phi(z|x)}} \left[ \log(\color{blue}{p_\theta(x|z)} \color{black}) \right]}\]

We seek the optimal parameters:

\[\color{green}{\phi}^*, \color{blue}{\theta}^* \color{black}{= \arg\min_{\color{green}{\phi}, \color{blue}{\theta}} \mathcal{L}_{\text{VAE}}(\color{blue}{\theta}, \color{green}{\phi} \color{black}; x )}\]

Interpreting the VAE Loss

The VAE loss has two terms:

  1. KL Regularization Term: \(\text{KL}\biggl(\color{green}{q_\phi(z|x)} \; || \; \color{purple}{p_\theta(z)}\color{black} \biggr)\)
    • Encourages the encoder’s approximate posterior to stay close to the prior
    • Acts as a regularizer preventing overfitting
    • Ensures the latent space has a nice structure
  2. Reconstruction Term: \(-\mathbb{E}_{z \sim \color{green}{q_\phi(z|x)}} \left[ \log(\color{blue}{p_\theta(x|z)} \color{black}) \right]\)
    • Encourages the decoder to reconstruct the input accurately
    • When the decoder outputs Gaussian distributions, this becomes MSE (Mean Squared Error)
    • When the decoder outputs Bernoulli distributions, this becomes BCE (Binary Cross-Entropy)

It enforces the following properties:

  • Structured latent geometry: The KL term shapes latent space into a smooth, continuous manifold aligned with the prior.
  • Smooth interpolation: Nearby latent points decode to semantically similar outputs, enabling meaningful interpolations.
  • Valid sampling: Samples drawn from the prior fall in regions the decoder understands, producing realistic generations.

Parameter Names: What Are the Parameters of the VAE?

  • \(\color{green}{\phi}\) are the variational parameters (encoder parameters)
  • \(\color{blue}{\theta}\) are the generative parameters (decoder parameters)

Why This Solves Our Problem: What Is the Purpose of the VAE Loss?

Maximizing the ELBO (equivalently, minimizing \(\mathcal{L}_{\text{VAE}}\)) simultaneously:

  1. Increases \(\log(\color{teal}{p_\theta(x)} \color{black})\) (generates more realistic samples)
  2. Decreases \(\text{KL}\biggl(\color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)} \color{black}\biggr)\) (better posterior approximation)

And we can compute everything in the ELBO without ever needing the intractable \(\color{red}{p_\theta(z|x)}\)!

Limitations: What Are the Key Limitations of VAEs?

While VAEs provide a principled and elegant framework for learning latent variable models, they have several limitations:

1. Blurry Reconstructions

One of the most noticeable issues with VAEs is their tendency to produce blurry reconstructions, particularly for images. When the decoder models the likelihood as a Gaussian distribution with a fixed variance, the VAE loss penalizes deviations from the mean prediction. As a result, when there are multiple plausible outputs for a given input, the model is incentivized to output their average; this produces a blurry result rather than choosing one sharp possibility.

Why does this happen?

When the decoder models the likelihood as a Gaussian distribution with mean \(\mu_\theta(z)\) (the decoder’s output) and fixed variance \(\sigma^2\), that is, \[p_\theta(x|z) = \mathcal{N}(x; \mu_\theta(z), \sigma^2 I)\]

The probability density function is: \[ p_\theta(x|z) = \frac{1}{(2\pi \sigma^2)^{D/2}} \exp\left(-\frac{1}{2\sigma^2} \|x - \mu_\theta(z)\|^2\right) \] where \(D\) is the data dimensionality.

Taking the log of both sides: \[ \log p_\theta(x|z) = \log\left[\frac{1}{(2\pi \sigma^2)^{D/2}}\right] + \log\left[\exp\left(-\frac{1}{2\sigma^2} \|x - \mu_\theta(z)\|^2\right)\right] \]

Now simplify each term separately:

Starting with the first term, the normalization constant: \[ \log\left[\frac{1}{(2\pi \sigma^2)^{D/2}}\right] = \log(1) - \log\left[(2\pi \sigma^2)^{D/2}\right] = 0 - \frac{D}{2}\log(2\pi\sigma^2) = -\frac{D}{2}\log(2\pi\sigma^2) \] where we used: \(\log(1/a) = \log(1) - \log(a) = -\log(a)\) and \(\log(a^b) = b\log(a)\)

Then, the second term, the exponential: \[ \log\left[\exp\left(-\frac{1}{2\sigma^2} \|x - \mu_\theta(z)\|^2\right)\right] = -\frac{1}{2\sigma^2} \|x - \mu_\theta(z)\|^2 \] where we used: \(\log(\exp(a)) = a\)

Finally, combining both terms: \[ \log p_\theta(x|z) = -\frac{D}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \|x - \mu_\theta(z)\|^2 \]

Next, identify the constant term:

The first term, \(-\frac{D}{2}\log(2\pi\sigma^2)\), is constant with respect to the model parameters \(\theta\) (as long as \(\sigma^2\) is fixed). This means it has no impact on optimization—so when training the model, we can safely ignore or drop this constant term. \[ \log p_\theta(x|z) \propto - \frac{1}{2\sigma^2} \|x - \mu_\theta(z)\|^2 \]

Connect to the reconstruction loss:

Recall that the reconstruction term in the VAE loss is the negative expected log-likelihood, that is, \[ \text{Reconstruction Loss} = -\mathbb{E}_{z \sim q_\phi(z|x)} \left[ \log p_\theta(x|z) \right] \]

Substituting our Gaussian likelihood: \[ \text{Reconstruction Loss} = \mathbb{E}_{z \sim q_\phi(z|x)} \left[ \frac{1}{2\sigma^2} \|x - \mu_\theta(z)\|^2 \right] + \text{const} \]

The MSE connection:

The squared Euclidean distance \(\|x - \mu_\theta(z)\|^2\) is exactly the mean squared error (MSE): \[ \|x - \mu_\theta(z)\|^2 = \sum_{j=1}^D (x_j - \mu_\theta(z)_j)^2 = D \cdot \text{MSE} \]

Summary
  • Maximizing \(\log p_\theta(x|z)\) ⟺ Minimizing \(\|x - \mu_\theta(z)\|^2\)
  • Therefore, the Gaussian likelihood assumption (with fixed variance) makes the reconstruction loss equivalent to MSE
  • MSE penalizes any deviation from the mean prediction, encouraging the decoder to output averaged or blurred reconstructions when multiple plausible outputs exist.
  • When the data has multiple valid reconstructions (e.g., an image could have either a cat or a dog), MSE forces the decoder to hedge its bets by outputting something in between—resulting in a blurry average rather than committing to one sharp possibility.
Possible Solutions (for further reading)

Several approaches have been proposed to address the blurriness problem:

  1. Use different likelihood models:
    • For binary images (e.g., MNIST): Use Bernoulli likelihood instead of Gaussian \[p_\theta(x|z) = \prod_{j=1}^D \text{Bernoulli}(x_j; \mu_\theta(z)_j)\] This leads to binary cross-entropy loss instead of MSE
    • For natural images: Use discretized logistic mixture models or PixelCNN-style decoders.
  2. Learn the variance instead of fixing it:
    • Let the decoder output both \(\mu_\theta(z)\) and \(\sigma^2_\theta(z)\)
    • The model can adapt variance to different regions of the latent space
    • Can lead to better reconstructions but requires careful training
  3. Alternative divergences:
    • Use Wasserstein distance instead of KL divergence Wasserstein Auto-Encoders
    • Can lead to better sample quality

2. Posterior Collapse

Posterior collapse occurs when the encoder becomes uninformative and the decoder ignores the latent variable \(z\). In this case,the model behaves like a decoder-only model driven by the prior.

Why does this happen?

Posterior collapse typically occurs when:

  • Decoder is too powerful: When using autoregressive decoders (e.g., LSTMs for text, PixelCNN for images), the decoder can model the data distribution well without needing information from \(z\). In this case, the decoder’s modeling capacity far exceeds what’s needed, it may find it easier to ignore \(z\) entirely.

  • Optimization dynamics: During early training, if the decoder quickly learns to generate reasonable outputs without \(z\), the gradient signal to the encoder becomes weak

Mathematical perspective:

The VAE loss encourages minimizing: \[\mathcal{L}_{\text{VAE}} = \text{KL}\biggl(q_\phi(z|x) \; || \; p_\theta(z)\biggr) - \mathbb{E}_{z \sim q_\phi(z|x)} \left[ \log p_\theta(x|z) \right]\]

If the decoder can achieve low reconstruction error without using \(z\), then \(\log p_\theta(x|z) \approx \log p_\theta(x)\) (independent of \(z\)). In this case, the encoder has no incentive to encode meaningful information, and setting \(q_\phi(z|x) = p_\theta(z)\) minimizes the KL term without hurting reconstruction.

Here are some signs of posterior collapse:

  1. KL divergence drops to near zero: \[\text{KL}\biggl(q_\phi(z|x) \; || \; p_\theta(z)\biggr) \approx 0\] This means the encoder’s output is essentially identical to the prior.

  2. Encoder becomes input-independent: \[q_\phi(z|x) \approx p_\theta(z) \text{ for all } x\] The encoder outputs the same distribution regardless of the input.

  3. Decoder ignores the latent code: The decoder learns to generate outputs without using information from \(z\), relying solely on its own parameters

Summary
  • Posterior collapse occurs when the decoder can achieve low reconstruction error without using \(z\), leading the encoder to set \(q_\phi(z|x) = p_\theta(z)\).
  • This happens when the decoder is too powerful or the optimization dynamics are not conducive.
Possible Solutions (for further reading)

Several techniques have been developed to prevent or mitigate posterior collapse:

  1. KL Annealing:

    • Start training with \(\beta = 0\) in \(\beta\)-VAE and gradually increase to 1
    • Gives the decoder time to learn to use \(z\) before the KL penalty becomes strong

    Modified loss function: \[\mathcal{L}_{\text{KL Annealing}} = \beta_t \cdot \text{KL}\biggl(q_\phi(z|x) \; || \; p_\theta(z)\biggr) - \mathbb{E}_{z \sim q_\phi(z|x)} \left[ \log p_\theta(x|z) \right]\]

    where \(\beta_t\) is a time-dependent weight that gradually increases during training.

    An exaplample of a common schedule is the linear schedule: \[\beta_t = \min(1, t/T)\] where \(t\) is the training step and \(T\) is the annealing period

    • References:
      • Cyclical Annealing Schedule: A Simple Approach to Mitigating KL Vanishing
      • Beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework
      • Understanding disentangling in β-VAE
  2. Free Bits:

    • Ensure a minimum amount of information is encoded in each latent dimension
    • Prevents any dimension from collapsing to the prior

    Modified loss function: \[\mathcal{L}_{\text{Free Bits}} = \sum_{j=1}^{d_z} \max\biggl(\lambda, \text{KL}_j\biggl(q_\phi(z_j|x) \; || \; p_\theta(z_j)\biggr)\biggr) - \mathbb{E}_{z \sim q_\phi(z|x)} \left[ \log p_\theta(x|z) \right]\]

    where:

    • \(d_z\) is the latent dimension
    • \(\text{KL}_j\) is the KL divergence for the \(j\)-th latent dimension
    • \(\lambda\) is the minimum bits threshold (e.g., \(\lambda = 0.5\) nats)

    How it works:

    If a dimension’s KL drops below \(\lambda\), that dimension contributes \(\lambda\) to the loss (constant). Once KL exceeds \(\lambda\), the actual KL is used. This creates a “free” zone where dimensions below threshold aren’t penalized further.

    Key difference from KL Annealing:

    • KL Annealing: Applies a global time-varying weight \(\beta_t\) to the entire KL term: \(\beta_t \cdot \text{KL}(q_\phi(z|x) \; || \; p_\theta(z))\)
    • Free Bits: Applies dimension-wise thresholding throughout training (no temporal schedule)
    • KL Annealing is a curriculum strategy; Free Bits is a regularization constraint
    • References:
      • First introduced in Improved Variational Inference with Inverse Autoregressive Flow
  3. Architectural changes:

    • Use weaker decoders (fewer layers, smaller hidden dimensions)
    • Use stronger encoders (more capacity to learn meaningful representations)
  4. Aggressive training of encoder:

    • Update encoder more frequently than decoder:

      In PyTorch, this can be achieved by performing multiple encoder updates for each decoder update during the training loop. For example, if you want to update the encoder 5 times for every 1 decoder update:

      encoder_optimizer = torch.optim.Adam(encoder.parameters(), lr=1e-3)
      decoder_optimizer = torch.optim.Adam(decoder.parameters(), lr=1e-3)
      
      for batch in dataloader:
          x = batch["input"].to(device)
          # --- Aggressive encoder training: ---
          for _ in range(5):  # Update encoder 5x
              encoder_optimizer.zero_grad()
              z_mu, z_logvar = encoder(x)
              # Implement the reparameterization trick here
              z = reparameterize(z_mu, z_logvar)
              recon = decoder(z)
              loss = vae_loss(recon, x, z_mu, z_logvar)
              loss.backward()
              encoder_optimizer.step()
      
          # ----- Decoder update -----
          decoder_optimizer.zero_grad()
          z_mu, z_logvar = encoder(x).detach()  # Optionally, stop gradients to encoder
          z = reparameterize(z_mu, z_logvar)
          recon = decoder(z)
          loss = vae_loss(recon, x, z_mu, z_logvar)
          loss.backward()
          decoder_optimizer.step()

      This technique can help the encoder keep up with a fast-learning or powerful decoder, mitigating posterior collapse.

    • Pre-train encoder before joint training

  5. Auxiliary losses:

    • Add discriminative tasks that require using \(z\) (e.g., predict attributes from \(z\))
    • Add mutual information maximization terms (e.g., maximize the mutual information between the latent variable \(z\) and the observed data \(x\), \(I(z; x)\), to encourage \(z\) to carry more information about \(x\) and prevent it from being ignored by the decoder)

3. Limited Expressiveness

VAEs are flexible models that can be used to model complex distributions. One can start with a simple model and gradually increase the complexity. For instance, one can start with a simple diagonal Gaussian for the approximate posterior, \(q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \text{diag}(\sigma^2_\phi(x)))\), and a fixed isotropic Gaussian prior, \(p_\theta(z) = \mathcal{N}(0, I)\). These are reasonable assumptions to start with but they might be too restrictive for some applications.

Why does this happen?

Limited expressiveness can be caused by the following problems:

  1. Posterior is too simple (Amortization Gap)

Most VAE implementations use a simple diagonal Gaussian for the approximate posterior: \[q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \text{diag}(\sigma^2_\phi(x)))\]

However, the true posterior \(p_\theta(z|x)\) might be:

  • Multimodal (multiple peaks) - e.g., an ambiguous image could encode to multiple plausible interpretations
  • Skewed or have heavy tails
  • Have complex correlations between dimensions

This can limit how well the VAE fits the true data distribution. A term to describe this is amortization gap.

The amortization gap is the difference between using a variational inference (VI) per data point, \(q^*(z|x)\), and a single shared encoder network \(q_\phi(z|x)\) that is used for all data points.

\[ \text{Amortization Gap} = \underbrace{L(q^*(z|x))}_{VI\;per\; data\;point} - \underbrace{L(q_\phi(z|x))}_{encoder's\;amortized\;VI\newline for\; all\; data\; points} \]

  1. Prior is too simple and fixed

The standard VAE uses a fixed isotropic Gaussian prior: \[p_\theta(z) = \mathcal{N}(0, I)\]

This can lead to the following limitations:

  • Doesn’t adapt to the data
  • Assumes all dimensions are independent and equally important
  1. Prior-Posterior Mismatch

The KL term \(\text{KL}(q_\phi(z|x) \; || \; p_\theta(z))\) tries to force the posterior close to the prior. But if:

  • The prior is too simple to match the natural structure of the aggregated (over all data points) posterior \(p_{\text{agg}}(z) = \int q_\phi(z|x) p_{\text{data}}(x) dx\)
  • The posterior family is too simple to match the true (per-data-point) posterior \(p_\theta(z|x)\)

Then we get a mismatch. That is, the model must choose between:

  1. Good reconstructions (complex posterior, high KL penalty)
  2. Low KL divergence (simple posterior that matches prior, poor reconstructions)
Summary
  • The amortization gap is the difference between using a flexible, datapoint-specific posterior and the limitations from using a single shared encoder network. This limits how well the VAE can approximate the true posterior.
  • A mismatch between the prior and posterior families (often because the prior is too simple or the approximate posterior is not expressive enough) creates a trade-off between accurate reconstructions and forcing latent codes to match the prior, leading to less optimal models.
Possible Solutions (for further reading)

These solutions can improve either the posterior, prior, or both:

1. Normalizing Flows (for posterior): Transform a simple base distribution through invertible transformations to create a more expressive approximate posterior: \[z_0 \sim q_0(z_0), \quad z_K = f_K \circ f_{K-1} \circ \cdots \circ f_1(z_0)\]

The final distribution \(q(z_K)\) can capture complex, multimodal structures.

  • References:

    • Variational Inference with Normalizing Flows

    • Improved Variational Inference with Inverse Autoregressive Flow

2. Mixture of Gaussians Posteriors: Use a mixture distribution for the approximate posterior: \[q_\phi(z|x) = \sum_{k=1}^K \pi_k(x) \mathcal{N}(z; \mu_k(x), \Sigma_k(x))\]

This allows multimodal posteriors where a single input can map to multiple plausible latent codes.

3. VampPrior (Variational Mixture of Posteriors Prior): Instead of a fixed prior, use a learnable mixture of posteriors: \[p(z) = \frac{1}{K} \sum_{k=1}^K q_\phi(z|u_k)\]

where \(\{u_k\}_{k=1}^K\) are learnable pseudo-inputs. This allows the prior to adapt to the data and better match the aggregated posterior.

  • Reference:

    • VAE with a VampPrior

4. Normalizing Flows (for prior): Learn a flexible prior using normalizing flows: \[p_\theta(z) = p_0(f_\theta^{-1}(z)) \left|\det \frac{\partial f_\theta^{-1}(z)}{\partial z}\right|\]

This allows the prior to have complex structure while remaining tractable.

5. Hierarchical/Ladder VAEs: Use multiple levels of latent variables: \[p_\theta(z_1, z_2, \ldots, z_L) = p(z_L) \prod_{l=1}^{L-1} p_\theta(z_l | z_{l+1})\]

Each level can capture different aspects of the data at different scales.

  • Reference:

    • Ladder Variational Autoencoders

6. Two-Stage VAE:

First train a standard VAE, then fit a more flexible prior to the aggregated posterior using techniques like:

  • Realnvp on the aggregated posterior samples
  • Train a second VAE on \(z\) samples

This decouples prior learning from posterior learning.

4. Inability to Directly Measure True Likelihood

Since the ELBO is only a lower bound, we cannot directly compute the true log-likelihood \(\log p_\theta(x)\).

\[\log p_\theta(x) = \text{ELBO} + \underbrace{\text{KL}(q_\phi(z|x) \; || \; p_\theta(z|x))}_{\text{unknown gap}}\]

This limitation means that, it is difficult to compare different VAE models or assess how well the model fits the true data distribution. Also, ELBO improvements don’t necessarily mean better generative quality.

Why does this happen?

To compute the true log-likelihood \(\log p_\theta(x)\), we would need to marginalize over all possible latent values:

\[\log p_\theta(x) = \log \int p_\theta(x, z) \, dz = \log \int p_\theta(x|z) p_\theta(z) \, dz\]

This integral is generally intractable because it is not possible to compute the integral analytically or in a tractable way.

ELBO helps but doesn’t fully solve this problem.

We derived that: \[\log p_\theta(x) = \text{ELBO}(\theta, \phi; x) + \text{KL}(q_\phi(z|x) \; || \; p_\theta(z|x))\]

  • We can compute ELBO easily
  • But the gap (the KL term) is unknown because it involves the intractable posterior \(p_\theta(z|x)\)
  • Therefore, we only have a lower bound on the true likelihood, not the exact value

The practical implication is that we can’t directly evaluate “how likely is this data under my model?” or accurately compare likelihoods between different VAE models.

Summary
  • The true log-likelihood \(\log p_\theta(x)\) is intractable to compute.
  • ELBO helps but doesn’t fully solve this problem.
  • We only have a lower bound on the true likelihood, not the exact value
  • The practical implication is that we can’t directly evaluate “how likely is this data under my model?” or accurately compare likelihoods between different VAE models.
Possible Solutions (for further reading)

There are several possible solutions to this problem:

1. Importance Sampling (Importance Weighted Autoencoder - IWAE):

Estimate the true log-likelihood using multiple samples:

We can approximate the log-likelihood using importance sampling because we have access to all components of the equation via our VAE:

\[ \log p_\theta(x) \approx \log \frac{1}{K} \sum_{k=1}^K \frac{p_\theta(x, z_k)}{q_\phi(z_k|x)}, \quad z_k \sim q_\phi(z|x) \]

where the joint distribution in the numerator can be expanded as: \[ p_\theta(x, z_k) = p_\theta(x|z_k)\; p_\theta(z_k) \]

  • \(q_\phi(z|x)\): The encoder of the VAE parameterized by \(\phi\) generates samples \(z_k\) given \(x\).
  • \(p_\theta(x|z_k)\): The decoder of the VAE parameterized by \(\theta\) computes the likelihood of \(x\) given the sampled latent \(z_k\).
  • \(p_\theta(z_k)\): The VAE prior (typically standard Gaussian) — we can evaluate this for our sampled \(z_k\).

Thus, all terms are explicitly defined in a trained VAE, allowing us to numerically estimate the log-likelihood via this weighted average.

As \(K \to \infty\), this converges to the true log-likelihood. In practice:

  • \(K = 5000\) samples gives reasonable estimates

  • Computationally expensive (requires many forward passes through the network for each sample \(z_k\))

  • Provides a tighter lower bound on the true log-likelihood than standard ELBO

  • Reference:

    • Importance Weighted Autoencoders

2. Human Evaluation:

Ask humans to rate samples on multiple dimensions:

  • Quality: “Does this look realistic?”
  • Diversity: “Are the samples varied enough?”
  • Semantic coherence: “Does this match the description/input?”

Common approaches:

  • A/B testing: Compare samples from different models side-by-side
  • Likert scale ratings (e.g., 1-5 quality scores)
  • Turing test-style: Can humans distinguish real from generated?

3. Reconstruction Metrics:

For evaluating reconstruction quality specifically:

  • MSE: Pixel-level reconstruction accuracy
  • SSIM (Structural Similarity Index): Perceptual reconstruction quality
  • L1/L2 distance in latent space: For measuring latent space quality

Despite these limitations, VAEs remain a powerful and widely-used framework. The zoo of published variants is large, but most practical ideas still cluster into a handful of families, according to which ingredient of the VAE they change (objective, latent distributions, decoder/likelihood, or local geometry). The list below is a road map, not a catalogue: every bullet is representative, not exhaustive. Several of these techniques have already appeared in the collapsible Possible Solutions boxes within the limitations section above; they are collected here to show how they relate to one another as a whole.

  • Tighter Monte Carlo objectives. Multi-sample importance-weighted bounds (IWAE and follow-ons) improve estimation of the marginal likelihood, that is \(p_\theta(x) = \int p_\theta(x|z) p_\theta(z) \, dz\), and gradient quality, while keeping the same generative story \(p_\theta(x,z)=p_\theta(x|z)p_\theta(z)\).

  • Reweighted KL and independence-seeking objectives. \(\beta\)-VAE, FactorVAE, \(\beta\)-TCVAE, and related methods reshape the ELBO to stress independence, total correlation, or other structure in \(q_\phi(z|x)\), often motivated by disentanglement.

  • Different regularizers on latent or data distributions. Wasserstein / MMD-style autoencoders (e.g., WAE) swap the KL for other distances or penalties; adversarial regularization of the encoder or latent space also fits here.

  • Richer priors and approximate posteriors. Normalizing flows in \(q_\phi(z|x)\) or \(p_\theta(z)\), learned mixture priors (VampPrior), and hierarchical or ladder architectures (e.g. NVAE , VDVAE, and LVAE) increase the flexibility of latent inference and the top-level prior.

  • Different latent variable types. Vector quantization (VQ-VAE and successors), categorical or mixed continuous–discrete bottlenecks, and codebook-based latents change the support of \(z\) rather than only tuning a Gaussian ELBO.

  • More expressive decoders and likelihoods. Autoregressive pixels, decoders built from normalizing flows, and, especially in recent image systems, hybrids that pair a latent bottleneck with diffusion or score-based decoders (e.g. Multimodal ELBO with Diffusion Decoders) aim to fix the classic blur / limited likelihood of shallow Gaussian decoders.

  • Explicit geometry of the decoder map. Penalties on the Jacobian such as Jacobian \(L_1\) regularization (and related spectral or orthogonality constraints on sensitivities \(\partial x / \partial z\)) regularize how latent directions move the reconstruction, complementing the purely probabilistic knobs in the items above. They do not replace the ELBO; they sit beside it.

Jacobian-Regularized VAEs (“Jacobian VAE”)

A “Jacobian VAE” refers to a standard VAE whose loss function is enhanced with an additional penalty on the Jacobian of the decoder, that is, on how changes in the latent space affect the reconstructed output. This Jacobian-based term acts as an explicit geometric regularizer, added on top of the usual ELBO objective. There is no separate probabilistic derivation for this regularizer; it supplements the standard approach. A notable example is the Jacobian \(L_1\) regularization introduced by Rhodes and Lee (arXiv:2106.02923, NeurIPS 2021), which encourages local disentanglement in the learned latent representations.

What problem is it trying to solve?

When training VAEs for unsupervised representation learning, a major challenge is identifiability: different latent parameterizations can produce essentially the same reconstructions. In particular, the latent space often has rotational (and more generally, linear) ambiguities. If \(U\) is an orthogonal matrix with \(U^\top U = I\), we can replace a latent vector \(\vec{z}\) by \(U\vec{z}\) and adjust the decoder accordingly, often with little or no loss in reconstruction quality. Because a sufficiently flexible decoder can absorb such transformations, there is generally no unique latent basis preferred by reconstruction alone. The result is a non-identifiable latent space, where multiple equally valid coordinate systems explain the same data.

One important attempt to address this is the \(\beta\)-VAE (discussed in under Posterior collapse), which modifies the standard ELBO by increasing the weight on the KL term:

\[ \mathcal{L}_{\beta\text{-VAE}} = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - \beta \, \mathrm{KL}(q_\phi(z|x)\;||\;p(z)), \qquad \beta > 1 \]

Compared with a standard VAE, this places stronger pressure on the latent space to stay close to the prior and to use dimensions more independently. In practice, this often encourages different latent coordinates to capture different factors of variation, so the learned axes become more interpretable. However, this only helps partially: if several latent directions influence the decoder with similar strength, then the model can still rotate those directions among themselves without substantially changing reconstruction quality. In other words, \(\beta\)-VAE encourages factorization, but it does not fully remove the remaining rotational ambiguity.

Mental model: imagine two latent knobs that together control an object’s horizontal position and brightness. A plain VAE may learn two mixed knobs, where turning either knob changes both position and brightness. A \(\beta\)-VAE often helps separate these effects somewhat, but the model may still treat any rotated combination of those two knobs as equally good. Jacobian-based regularization goes one step further: it prefers a basis where one knob changes mostly one kind of feature and the other knob changes another, because that produces a sparser and more axis-aligned local sensitivity pattern in the decoder. In this sense, the Jacobian penalty breaks the symmetry that reconstruction loss alone leaves unresolved and encourages local disentanglement.

How does it work?

To make the Jacobian idea concrete, it helps to separate the roles of the encoder and decoder and to use notation that distinguishes latent-space quantities from data-space quantities. The encoder takes an observed data point \(x\) and outputs the parameters of the approximate posterior, typically a latent mean and latent standard deviation:

\[ q_\phi(z|x) = \mathcal{N}(\color{blue}{\mu_z(x)} \color{black}{, \mathrm{diag}(\sigma_z^2(x))}) \]

This is the distribution from which we sample the latent variable \(z\). The decoder then works in the opposite direction: given a latent code \(z\), it defines a conditional distribution over possible reconstructions, written as \(p_\theta(x|z)\).

In many common VAEs for continuous data, this decoder likelihood is chosen to be Gaussian:

\[ p_\theta(x|z) = \mathcal{N}(x; \color{green}{\mu_x(z)} \color{black}{, \sigma_x^2 I}) \]

Here the subscripts are meant to be mnemonic:

  • \(\color{blue}{\mu_z(x)}\) and \(\color{black}{\sigma_z(x)}\) are the encoder outputs because they parameterize a distribution over the latent variable \(z\).
  • \(\color{green}{\mu_x(z)}\) is the decoder output because it parameterizes a distribution over the data variable \(x\).

In other words, the encoder mean, \(\color{blue}{\mu_z(x)}\), lives in latent space, while the decoder mean, \(\color{green}{\mu_x(z)}\), lives in data space. In the Gaussian case, the neural network in the decoder outputs \(\color{green}{\mu_x(z)}\), the mean of the reconstruction distribution. If a different likelihood is used, such as Bernoulli for binary images, the decoder may instead output probabilities or logits; the same idea still applies. For Jacobian regularization, we differentiate the decoder’s output map with respect to the latent variable. The generative Jacobian is therefore:

\[ J_\theta(z) = \frac{ \overbrace{\partial \color{green}{\mu_x(z)}}^{\text{data space}} }{ \underbrace{\partial z^\top}_{\text{latent space}} } \in \mathbb{R}^{D \times d} \]

where \(D\) is the dimension of the data space and \(d\) is the dimension of the latent space. Thus, the Jacobian is a local linear map from latent space to data space: it tells us how a small change in a latent coordinate \(z_j\) changes the reconstructed output coordinate \(x_i\) (where \(x_i\) is the \(i\)-th coordinate of the data vector \(x\)). In particular, entry \([J_\theta(z)]_{ij}\) measures the sensitivity of the \(i\)-th data-space coordinate to perturbations in the \(j\)-th latent-space coordinate.

Jacobian \(L_1\) regularization in the JL1-VAE (Rhodes & Lee, 2021) augments the \(\beta\)-VAE objective with an \(L_1\) penalty on the generative Jacobian, modulated by a separate hyperparameter \(\gamma\). Their maximization objective per datapoint \(x\) is:

\[ \mathcal{L}_{\text{JL1}}(x) = \mathbb{E}_{z \sim q_\phi(z|x)} \Bigl[ \log p_\theta(x \mid z) \;-\; \gamma\,\|J_\theta(z)\|_1 \Bigr] \;-\; \beta\,\mathrm{KL}\!\left(q_\phi(z|x)\;\|\;\mathcal{N}(0,I)\right) \]

where \(\beta \geq 1\) is the standard \(\beta\)-VAE KL weight and \(\gamma > 0\) controls the strength of the Jacobian penalty. Note that this is a maximization objective: the \(L_1\) term is subtracted inside the expectation so that maximizing \(\mathcal{L}_{\text{JL1}}\) penalizes large Jacobian entries.

The Jacobian \(L_1\) norm in the penalty term expands as:

\[ \|J_\theta(z)\|_1 = \sum_{i=1}^{D}\sum_{j=1}^{d}\left|[J_\theta(z)]_{ij}\right| = \sum_{i=1}^{D}\sum_{j=1}^{d}\left|\frac{\partial \mu_{x,i}(z)}{\partial z_j}\right| \]

The \(L_1\) penalty encourages sparsity because it adds up the absolute values of all local sensitivities \(\left|\frac{\partial \mu_{x,i}(z)}{\partial z_j}\right|\). To reduce this cost, the model is encouraged to make many of these sensitivities exactly zero or very small, so that each latent direction affects only a limited subset of output coordinates rather than weakly affecting everything.

It is helpful to compare this with an \(L_2\) penalty.

For example, the corresponding Jacobian \(L_2\) penalty would be:

\[ \|J_\theta(z)\|_2^2 = \sum_{i=1}^{D}\sum_{j=1}^{d}\left([J_\theta(z)]_{ij}\right)^2 = \sum_{i=1}^{D}\sum_{j=1}^{d}\left(\frac{\partial \mu_{x,i}(z)}{\partial z_j}\right)^2 \]

Suppose we fix one latent coordinate \(z_j\). Its local effects on all output coordinates form the \(j\)-th column of the Jacobian, a vector in \(\mathbb{R}^D\). For example, if \(D=5\), that column could be \([1,1,0,0,0]\). Then this column’s \(L_1\) contribution is \(1+1=2\). If the same total effect is spread across many output coordinates, for example \([0.4,0.4,0.4,0.4,0.4]\), the \(L_1\) contribution is still \(5 \times 0.4 = 2\). In other words, under \(L_1\), spreading influence across many small entries is not cheaper than concentrating it into a few larger ones.

By contrast, under \(L_2\) the first pattern has cost \(1^2 + 1^2 = 2\), while the second has cost \(5 \times 0.4^2 = 0.8\). So \(L_2\) prefers to spread influence over many small nonzero entries, whereas \(L_1\) is more compatible with having many zeros and a few active effects. In the Jacobian setting, this means \(L_1\) encourages a decoder where each latent coordinate changes only a smaller, more selective part of the output. Because \(L_1\) is also not invariant under arbitrary rotations of the columns of \(J\), it can pick a preferred orientation among otherwise equivalent latent bases.

The phrase “not invariant under arbitrary rotations” means the following. Suppose, in a simple two-dimensional latent space, the local decoder Jacobian is:

\[ J = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}. \]

Its \(L_1\) cost is \(|1|+|0|+|0|+|1| = 2\). Now rotate the latent coordinates by \(45^\circ\). Let \(u\) denote the latent coordinates in the rotated basis (for the same latent point), while \(z\) denotes coordinates in the original basis. Write this as

\[ z = Ru, \qquad R= \begin{bmatrix} \cos 45^\circ & -\sin 45^\circ\\ \sin 45^\circ & \cos 45^\circ \end{bmatrix} = \begin{bmatrix} \tfrac{1}{\sqrt{2}} & -\tfrac{1}{\sqrt{2}}\\ \tfrac{1}{\sqrt{2}} & \tfrac{1}{\sqrt{2}} \end{bmatrix}. \]

By the chain rule, the Jacobian in the rotated coordinates is

\[ \tilde J(u) = \frac{\partial \mu_x(Ru)}{\partial u} = \underbrace{\frac{\partial \mu_x}{\partial z}}_{J(z)} \underbrace{\frac{\partial z}{\partial u}}_{R} = J(z)\,R. \]

Here \(\frac{\partial z}{\partial u}=R\) simply because we defined \(z=Ru\): each component is \(z_i=\sum_j R_{ij}u_j\), so \(\frac{\partial z_i}{\partial u_j}=R_{ij}\).

At the point where \(J=I\), this gives

\[ \tilde{J} = \begin{bmatrix} \tfrac{1}{\sqrt{2}} & -\tfrac{1}{\sqrt{2}} \\ \tfrac{1}{\sqrt{2}} & \tfrac{1}{\sqrt{2}} \end{bmatrix}. \]

This rotated Jacobian represents the same overall local geometry up to a change of latent coordinates, but its \(L_1\) cost is:

\[ \left\|\tilde{J}\right\|_1 = 2 \times \left|\tfrac{1}{\sqrt{2}}\right| + 2 \times \left|\tfrac{1}{\sqrt{2}}\right| = 4 \cdot \tfrac{1}{\sqrt{2}} = \tfrac{4}{\sqrt{2}} = \tfrac{4\sqrt{2}}{\sqrt{2}\cdot\sqrt{2}} = \tfrac{4\sqrt{2}}{2} = 2\sqrt{2} > 2. \]

So although both bases can be equally good for reconstruction, the \(L_1\) penalty prefers the axis-aligned version because it is sparser. That is what it means to say that \(L_1\) can pick a preferred orientation: among many rotated latent bases that reconstruct equally well, it favors the one in which each latent direction has a more concentrated, less mixed effect on the output.

Pros and cons

Pros

  • Targets a concrete failure mode of “soft” disentanglement methods: remaining rotational symmetry when singular values of the generative map are similar.
  • Principled connection to classical ICA-style goals via sparsity of local sensitivities.
  • Reported gains on local disentanglement metrics and qualitative axis alignment in multi-object / multi-part image settings (see the paper’s experiments).

Cons

  • Extra compute and memory per step compared to a plain VAE; large decoders make full Jacobians expensive, so implementations need care.
  • Another hyperparameter \(\lambda\) trading reconstruction / ELBO quality against regularization; too strong a penalty can hurt likelihood and reconstruction.
  • Like other unsupervised disentanglement claims, evaluation can be dataset- and metric-dependent; “disentanglement” is not uniquely defined.

Related approaches

  • \(\beta\)-VAE and variants (FactorVAE, \(\beta\)-TCVAE, etc.): reweight or decompose the KL term to encourage independence of latent factors. Kumar and Poole show that \(\beta\)-VAE also induces implicit regularization related to Jacobian and Hessian structure of the decoder (arXiv:2002.00041), connecting VAE training to Jacobian-based autoencoder heuristics.
  • Orthogonal Jacobian regularization (OroJaR, ICCV 2021): encourages orthogonality between sensitivities of different latent dimensions (a different geometric constraint than elementwise \(L_1\) sparsity).
  • Sparse VAE / Identifiable Deep Generative Models via Sparse Decoding (Moran et al., 2021, arXiv:2110.10804): pursues the same underlying goal as Jacobian \(L_1\) regularization: making each observed feature depend on only a small subset of latent factors. But it reaches it through a different route. Instead of penalizing the Jacobian at training time, it places a sparsity prior directly on the decoder so that each output coordinate is structurally linked to only a few latents. This architectural sparsity allows the authors to prove formal identifiability guarantees: given enough data, the true latent factors can be uniquely recovered. This is a stronger statement than what Jacobian \(L_1\) regularization offers, where sparsity is encouraged as a soft penalty but not guaranteed. The Sparse VAE can therefore be thought of as a principled, provably identifiable alternative that targets the same non-identifiability problem from the model structure rather than from the loss function.
  • Spectral / Jacobian norms in autoencoders: broader literature on penalizing Jacobian norms for smoothness, robustness, or identifiability outside the VAE setup; the VAE case combines these ideas with the ELBO.
  • Supervised or semi-supervised constraints: when factor labels or group structure are available, auxiliary losses often achieve more reliable axis semantics than purely unsupervised Jacobian penalties.

Conclusion: Key Takeaways and Final Remarks

The Variational Autoencoder framework provides a principled solution to learning latent variable models when exact inference is intractable. By introducing a tractable variational posterior and optimizing the Evidence Lower Bound (ELBO), we transform an otherwise impossible likelihood maximization problem into a practical and scalable objective that can be optimized with stochastic gradient methods.

From the derivation, several core insights emerge:

  • First, the VAE objective naturally decomposes into two competing terms:
    • A reconstruction term that encourages faithful data generation, via the decoder.
    • A KL regularization term that shapes the latent space by keeping the approximate posterior close to the prior, via the encoder.
    • This trade-off is not an implementation detail, but a direct consequence of variational inference.
  • Second, maximizing the ELBO simultaneously improves data likelihood and posterior approximation, even though the true posterior is never computed explicitly.

This framework extends seamlessly to Conditional VAEs by conditioning both the encoder and decoder on auxiliary information. The mathematical structure of the ELBO remains unchanged, highlighting the flexibility and generality of variational inference as a modeling paradigm.