Mathematical Derivation of VAE
What Is the Aim of This Blog?
This blog draws upon insights from this article and this post, with the objective of developing a better understanding of Variational Autoencoders (VAEs) for my own research.
Introduction: What Are Variational Autoencoders (VAEs)?
Variational Autoencoders (VAEs) are a foundational class of latent variable models that combine probabilistic modeling with neural networks, enabling both representation learning and data generation. Unlike standard autoencoders, VAEs provide a principled probabilistic framework that allows us to reason about uncertainty, impose structure on latent spaces, and perform meaningful sampling.
At a high level, VAEs assume that high-dimensional observations are generated from a lower-dimensional latent variable through a stochastic process. Learning such models, however, presents a central challenge: the true posterior distribution over latent variables is intractable. Variational inference resolves this by introducing a tractable approximation and reframing learning as an optimization problem.
The goal of this post is to develop a precise and fully mathematical understanding of how VAEs are derived; from their generative assumptions to the Evidence Lower Bound (ELBO) that is optimized in practice. Rather than focusing on intuition alone, I walk through each step of the derivation, clarifying where approximations are introduced and why the final objective is both computable and effective.
Intuition: What a VAE Is Really Learning
A VAE assumes that each data point can be explained by a small set of hidden factors (latent variables)that capture the essential structure of the data.
The encoder does not map an input to a single point in latent space, but instead learns a distribution over plausible latent representations, reflecting uncertainty about how the data was generated.
The decoder then learns how to probabilistically reconstruct the data from samples drawn from this latent distribution.
Training a VAE is therefore a balancing act: we want latent representations that are expressive enough to reconstruct the data well, while also being regularized to follow a simple prior distribution so that the latent space remains smooth and generative.
Assumptions: What Are the Core Assumptions of a VAE Model?
Assume we have a dataset \(X\):
\[X = [\vec{x}^{(i)}]_{i=1}^N = \{\vec{x}^{(1)}, \vec{x}^{(2)}, \ldots, \vec{x}^{(N)}\}\]
Each \(\vec{x}^{(i)}\) is IID. It can be continuous or discrete-valued.
The VAE framework makes the following fundamental assumptions about how the observed data is generated:
In essence, we assume the observed high-dimensional dataset \(X\) is generated by a latent variable model with an underlying lower-dimensional random process \(\vec{z}\).
The goal is to find the parameter \(\color{purple}{\theta^*}\) that makes our observed data as likely as possible under the model. In other words, we want to maximize the likelihood of the data:
\[\color{purple}{\theta^*} {\color{black}{= \arg \max_{\theta}\prod_{i=1}^N}} \color{teal}{p_\theta(\vec{x}^{(i)})}\]
However, it is usually more convenient and numerically stable to work with the log-likelihood (since logs turn products into sums). Therefore, we rewrite the objective as:
\[\color{purple}{\theta^*} \color{black}{ = \arg \max_{\theta}\sum_{i=1}^N \log} \; \color{black}{\left( \color{teal}{p_\theta(\vec{x}^{(i)})} \color{black} \right)}\]
To compute \(\color{teal}{p_\theta(\vec{x}^{i})}\), we marginalize over the latent variable \(z\):
\[\color{teal}{p_\theta(\vec{x}^{i})} \color{black}{=\int} \color{blue}{p_\theta(x|z)}\, \color{orange}{p_\theta(z)}\, \color{black}{dz}\]
However, directly computing this integral is typically intractable because it requires evaluating \(\color{blue}{p_\theta(x|z)}\) for all possible values of \(z\). To make this computation practical, we introduce an auxiliary function, \(\color{green}{q_\phi(z|x)}\), called the variational distribution or approximate posterior. This function, parameterized by \(\color{green}{\phi}\), provides a tractable way to estimate which values of \(z\) are likely given a particular input \(x\).
Architecture: What Is the Role of VAE’s Encoder and Decoder?
To understand the VAE architecture, let’s start with Bayes’ theorem, which relates our key distributions:
\[\color{red}{p_\theta(z|x)} \color{black}{= \frac{\color{orange}{p_\theta(z)} \color{black}{ \times} \color{blue}{p_\theta(x|z)}}{\color{teal}{p_\theta(x)}}}\]
This equation shows that the posterior distribution \(\color{red}{p_\theta(z|x)}\) (the probability of latent code \(z\) given observation \(x\)) can be computed from the prior \(\color{orange}{p_\theta(z)}\), the likelihood \(\color{blue}{p_\theta(x|z)}\), and the evidence \(\color{teal}{p_\theta(x)}\).
The Challenge: Intractable Posterior
The posterior \(\color{red}{p_\theta(z|x)}\) is intractable to compute directly because it requires knowing \(\color{teal}{p_\theta(x)}\), which involves the difficult integral we discussed earlier. This is where the VAE’s two-part architecture comes in:
In summary: the encoder compresses observations \(x\) into latent representations \(z\), while the decoder reconstructs observations from latent codes.
Training Objective: How Can We Learn \(\color{green}{\phi}\) and \(\color{blue}{\theta}\) Jointly?
Now we have two sets of parameters to optimize:
- Encoder parameters \(\color{green}{\phi}\): control the approximate posterior \(\color{green}{q_\phi(z|x)}\)
- Decoder parameters \(\color{blue}{\theta}\): control the likelihood \(\color{blue}{p_\theta(x|z)}\) (and also appear in the true posterior \(\color{red}{p_\theta(z|x)}\))
Our goal is to make the estimated posterior \(\color{green}{q_\phi(z|x)}\) as close as possible to the true (but intractable) posterior \(\color{red}{p_\theta(z|x)}\).
Measuring Closeness: The KL Divergence
To measure how “close” two probability distributions are, we use the Kullback-Leibler (KL) divergence. Specifically, we use the reverse KL divergence:
\[\text{KL} \Biggl( \color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)} \color{black}\Biggr) \color{black}{= \mathbb{E}_{z \sim \color{green}{q_\phi(z|x)}} \left[ \log \frac{\color{green}{q_\phi(z|x)}}{\color{red}{p_\theta(z|x)}} \right]}\]
which can be written more explicitly as:
For discrete latent variables: \[\text{KL} \Biggl( \color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)} \color{black}\Biggr) \color{black}{= \sum_{z \in Z} \color{green}{q_\phi(z|x)} \color{black}{\log} \left( \color{black}\frac{\color{green}{q_\phi(z|x)}}{\color{red}{p_\theta(z|x)}} \right)}\]
For continuous latent variables: \[\text{KL} \Biggl( \color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)}\color{black} \Biggr) \color{black}{= \int \color{green}{q_\phi(z|x)} \color{black}{\log} \left( \color{black}\frac{\color{green}{q_\phi(z|x)}}{\color{red}{p_\theta(z|x)}} \right) \color{black}{dz}}\]
By minimizing \(\text{KL} \Biggl( \color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)}\color{black} \Biggr)\), we make our encoder’s approximate posterior as accurate as possible. However, there’s a problem: we can’t directly compute this KL divergence because it involves the intractable posterior \(\color{red}{p_\theta(z|x)}\)!
This is where the ELBO (Evidence Lower Bound) comes in, which we’ll derive next.
ELBO Derivation: Deriving the Evidence Lower Bound
Recall that we want to minimize \(\text{KL} \Biggl( \color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)} \color{black}\Biggr)\), but we can’t compute it directly because it involves the intractable posterior \(\color{red}{p_\theta(z|x)}\).
The variational inference idea is that we can derive an alternative objective that:
- Is tractable to compute
- Still allows us to optimize both \(\color{green}{\phi}\) and \(\color{blue}{\theta}\)
- Automatically handles the intractability
Let’s derive this alternative objective step by step.
Step 1: Start with the KL Divergence
For continuous latent variables, the KL divergence is defined as:
\[\text{KL} \Biggl( \color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)} \color{black}\Biggr) \color{black}{= \int \color{green}{q_\phi(z|x)} \color{black}{\log} \left( \color{black}\frac{\color{green}{q_\phi(z|x)}}{\color{red}{p_\theta(z|x)}} \right) \color{black}{dz}}\]
Step 2: Apply Bayes’ Rule to the Posterior
Using Bayes’ rule, we know that:
\[\color{red}{p_\theta(z|x)} \color{black}{= \frac{p_\theta(x,z)}{\color{teal}{p_\theta(x)}}}\]
Substituting this into our KL divergence:
\[\text{KL} \Biggl( \color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)} \color{black}\Biggr) \color{black}{= \int \color{green}{q_\phi(z|x)} \color{black}{\log} \left( \color{black}\frac{\color{green}{q_\phi(z|x)} \cdot \color{teal}{p_\theta(x)}}{p_\theta(x,z)} \right) \color{black}{dz}}\]
Step 3: Split the Logarithm
Using the property \(\log(ab/c) = \log(a) + \log(b/c)\), where \(a = \color{teal}{p_\theta(x)}\), \(b = \color{green}{q_\phi(z|x)}\), and \(c = p_\theta(x,z)\):
\[\text{KL} \Biggl( \color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)} \color{black}\Biggr) \color{black}{= \int \color{green}{q_\phi(z|x)} \color{black}{\left[ \log(\color{teal}{p_\theta(x)} \color{black}) + \log \left( \frac{\color{green}{q_\phi(z|x)}}{p_\theta(x,z)} \right) \right]} \color{black}{dz}}\]
Step 4: Separate the Integral
Split into two integrals:
\[\text{KL}\Biggl(\color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)}\color{black} \Biggr) \color{black}{= \int \color{green}{q_\phi(z|x)} \color{black} \log(\color{teal}{p_\theta(x)} \color{black} ) \, dz + \int \color{green}{q_\phi(z|x)} \color{black} \log \left( \frac{\color{green}{q_\phi(z|x)}}{p_\theta(x,z)} \right) dz}\]
So now we have:
\[\text{KL}\Biggl(\color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)} \color{black}\Biggr) \color{black}{= \log(\color{teal}{p_\theta(x)} \color{black}) + \int \color{green}{q_\phi(z|x)} \color{black} \log \left( \frac{\color{green}{q_\phi(z|x)}}{p_\theta(x,z)} \right) \color{black}{dz}}\]
Step 5: Decompose the Joint Distribution
The joint distribution can be factored as:
\[p_\theta(x,z) = \color{blue}{p_\theta(x|z)} \cdot \color{purple}{p_\theta(z)}\]
Substituting this:
\[\text{KL}\Biggl(\color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)} \color{black}\Biggr) \color{black}{= \log(\color{teal}{p_\theta(x)} \color{black}) + \int \color{green}{q_\phi(z|x)} \color{black} \log \left( \frac{\color{green}{q_\phi(z|x)}}{\color{blue}{p_\theta(x|z)} \cdot \color{purple}{p_\theta(z)}} \right) \color{black}{dz}}\]
Step 6: Simplify Using Logarithm Properties
Using \(\log(a/(bc)) = \log(a/b) - \log(c)\), where \(a = \color{green}{q_\phi(z|x)}\), \(b = \color{purple}{p_\theta(z)}\), and \(c = \color{blue}{p_\theta(x|z)}\):
\[\text{KL}\Biggl(\color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)} \color{black}\Biggr) \color{black}{= \log(\color{teal}{p_\theta(x)} \color{black}) + \int \color{green}{q_\phi(z|x)} \color{black} \left[ \log \left( \frac{\color{green}{q_\phi(z|x)}}{\color{purple}{p_\theta(z)}} \right) - \log(\color{blue}{p_\theta(x|z)} \color{black}) \right]{dz}}\]
Step 7: Split Into Two Integrals
\[\text{KL}\Biggl(\color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)} \color{black}\Biggr) \color{black}{= \log(\color{teal}{p_\theta(x)} \color{black}) + \int \color{green}{q_\phi(z|x)} \color{black} \log \left( \frac{\color{green}{q_\phi(z|x)}}{\color{purple}{p_\theta(z)}} \right) \color{black}{dz} - \int \color{green}{q_\phi(z|x)} \color{black} \log(\color{blue}{p_\theta(x|z)} \color{black}) \color{black}{dz}}\]
Step 8: Recognize Standard Forms
The second term is the KL divergence between \(\color{green}{q_\phi(z|x)}\) and the prior \(\color{purple}{p_\theta(z)}\):
\[\int \color{green}{q_\phi(z|x)} \color{black} \log \left( \frac{\color{green}{q_\phi(z|x)}}{\color{purple}{p_\theta(z)}} \right) \color{black}{dz} = \text{KL}\Biggl(\color{green}{q_\phi(z|x)} \; || \; \color{purple}{p_\theta(z)}\color{black} \Biggr)\]
The third term is an expectation under \(\color{green}{q_\phi(z|x)}\):
\[\int \color{green}{q_\phi(z|x)} \color{black} \log(\color{blue}{p_\theta(x|z)} \color{black}) \color{black} \, dz = \mathbb{E}_{z \sim \color{green}{q_\phi(z|x)}} \left[ \log(\color{blue}{p_\theta(x|z)}\color{black}) \right]\]
Therefore:
\[\text{KL}\Bigg(\color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)} \color{black}\Biggr) = \log(\color{teal}{p_\theta(x)} \color{black}) + \text{KL}\Biggl(\color{green}{q_\phi(z|x)} \; || \; \color{purple}{p_\theta(z)}\color{black} \Biggr) - \mathbb{E}_{z \sim \color{green}{q_\phi(z|x)}} \left[ \log(\color{blue}{p_\theta(x|z)} \color{black}) \right]\]
Step 9: Rearrange to Isolate the Evidence (That Is \(p_\theta(x)\))
Rearranging the equation:
\[\log(\color{teal}{p_\theta(x)} \color{black}) = \text{KL}\biggl(\color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)} \color{black}\biggr) + \underbrace{\mathbb{E}_{z \sim \color{green}{q_\phi(z|x)}} \left[ \log(\color{blue}{p_\theta(x|z)} \color{black}) \right] - \text{KL}\biggl(\color{green}{q_\phi(z|x)} \; || \; \color{purple}{p_\theta(z)}\color{black} \biggr)}_{\text{ELBO}(\color{blue}{\theta}, \color{green}{\phi}; x)}\]
Step 10: The VAE Loss Function
In practice, we want to minimize a loss function. The VAE loss is simply the negative ELBO:
\[\mathcal{L}_{\text{VAE}}(\color{blue}{\theta}, \color{green}{\phi}; x \color{black}) = -\text{ELBO}(\color{blue}{\theta}, \color{green}{\phi}\color{black}; x )\]
\[\boxed{\mathcal{L}_{\text{VAE}}(\color{blue}{\theta}, \color{green}{\phi}; x \color{black}) = \text{KL}\biggl(\color{green}{q_\phi(z|x)} \; || \; \color{purple}{p_\theta(z)}\color{black} \biggr) - \mathbb{E}_{z \sim \color{green}{q_\phi(z|x)}} \left[ \log(\color{blue}{p_\theta(x|z)} \color{black}) \right]}\]
We seek the optimal parameters:
\[\color{green}{\phi}^*, \color{blue}{\theta}^* \color{black}{= \arg\min_{\color{green}{\phi}, \color{blue}{\theta}} \mathcal{L}_{\text{VAE}}(\color{blue}{\theta}, \color{green}{\phi} \color{black}; x )}\]
Parameter Names: What Are the Parameters of the VAE?
- \(\color{green}{\phi}\) are the variational parameters (encoder parameters)
- \(\color{blue}{\theta}\) are the generative parameters (decoder parameters)
Why This Solves Our Problem: What Is the Purpose of the VAE Loss?
Maximizing the ELBO (equivalently, minimizing \(\mathcal{L}_{\text{VAE}}\)) simultaneously:
- Increases \(\log(\color{teal}{p_\theta(x)} \color{black})\) (generates more realistic samples)
- Decreases \(\text{KL}\biggl(\color{green}{q_\phi(z|x)} \; || \; \color{red}{p_\theta(z|x)} \color{black}\biggr)\) (better posterior approximation)
And we can compute everything in the ELBO without ever needing the intractable \(\color{red}{p_\theta(z|x)}\)!
Limitations: What Are the Key Limitations of VAEs?
While VAEs provide a principled and elegant framework for learning latent variable models, they have several limitations:
1. Blurry Reconstructions
One of the most noticeable issues with VAEs is their tendency to produce blurry reconstructions, particularly for images. When the decoder models the likelihood as a Gaussian distribution with a fixed variance, the VAE loss penalizes deviations from the mean prediction. As a result, when there are multiple plausible outputs for a given input, the model is incentivized to output their average; this produces a blurry result rather than choosing one sharp possibility.
2. Posterior Collapse
Posterior collapse occurs when the encoder becomes uninformative and the decoder ignores the latent variable \(z\). In this case,the model behaves like a decoder-only model driven by the prior.
3. Limited Expressiveness
VAEs are flexible models that can be used to model complex distributions. One can start with a simple model and gradually increase the complexity. For instance, one can start with a simple diagonal Gaussian for the approximate posterior, \(q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \text{diag}(\sigma^2_\phi(x)))\), and a fixed isotropic Gaussian prior, \(p_\theta(z) = \mathcal{N}(0, I)\). These are reasonable assumptions to start with but they might be too restrictive for some applications.
4. Inability to Directly Measure True Likelihood
Since the ELBO is only a lower bound, we cannot directly compute the true log-likelihood \(\log p_\theta(x)\).
\[\log p_\theta(x) = \text{ELBO} + \underbrace{\text{KL}(q_\phi(z|x) \; || \; p_\theta(z|x))}_{\text{unknown gap}}\]
This limitation means that, it is difficult to compare different VAE models or assess how well the model fits the true data distribution. Also, ELBO improvements don’t necessarily mean better generative quality.
Despite these limitations, VAEs remain a powerful and widely-used framework. The zoo of published variants is large, but most practical ideas still cluster into a handful of families, according to which ingredient of the VAE they change (objective, latent distributions, decoder/likelihood, or local geometry). The list below is a road map, not a catalogue: every bullet is representative, not exhaustive. Several of these techniques have already appeared in the collapsible Possible Solutions boxes within the limitations section above; they are collected here to show how they relate to one another as a whole.
Tighter Monte Carlo objectives. Multi-sample importance-weighted bounds (IWAE and follow-ons) improve estimation of the marginal likelihood, that is \(p_\theta(x) = \int p_\theta(x|z) p_\theta(z) \, dz\), and gradient quality, while keeping the same generative story \(p_\theta(x,z)=p_\theta(x|z)p_\theta(z)\).
Reweighted KL and independence-seeking objectives. \(\beta\)-VAE, FactorVAE, \(\beta\)-TCVAE, and related methods reshape the ELBO to stress independence, total correlation, or other structure in \(q_\phi(z|x)\), often motivated by disentanglement.
Different regularizers on latent or data distributions. Wasserstein / MMD-style autoencoders (e.g., WAE) swap the KL for other distances or penalties; adversarial regularization of the encoder or latent space also fits here.
Richer priors and approximate posteriors. Normalizing flows in \(q_\phi(z|x)\) or \(p_\theta(z)\), learned mixture priors (VampPrior), and hierarchical or ladder architectures (e.g. NVAE , VDVAE, and LVAE) increase the flexibility of latent inference and the top-level prior.
Different latent variable types. Vector quantization (VQ-VAE and successors), categorical or mixed continuous–discrete bottlenecks, and codebook-based latents change the support of \(z\) rather than only tuning a Gaussian ELBO.
More expressive decoders and likelihoods. Autoregressive pixels, decoders built from normalizing flows, and, especially in recent image systems, hybrids that pair a latent bottleneck with diffusion or score-based decoders (e.g. Multimodal ELBO with Diffusion Decoders) aim to fix the classic blur / limited likelihood of shallow Gaussian decoders.
Explicit geometry of the decoder map. Penalties on the Jacobian such as Jacobian \(L_1\) regularization (and related spectral or orthogonality constraints on sensitivities \(\partial x / \partial z\)) regularize how latent directions move the reconstruction, complementing the purely probabilistic knobs in the items above. They do not replace the ELBO; they sit beside it.
Conclusion: Key Takeaways and Final Remarks
The Variational Autoencoder framework provides a principled solution to learning latent variable models when exact inference is intractable. By introducing a tractable variational posterior and optimizing the Evidence Lower Bound (ELBO), we transform an otherwise impossible likelihood maximization problem into a practical and scalable objective that can be optimized with stochastic gradient methods.
From the derivation, several core insights emerge:
- First, the VAE objective naturally decomposes into two competing terms:
- A reconstruction term that encourages faithful data generation, via the decoder.
- A KL regularization term that shapes the latent space by keeping the approximate posterior close to the prior, via the encoder.
- This trade-off is not an implementation detail, but a direct consequence of variational inference.
- Second, maximizing the ELBO simultaneously improves data likelihood and posterior approximation, even though the true posterior is never computed explicitly.
This framework extends seamlessly to Conditional VAEs by conditioning both the encoder and decoder on auxiliary information. The mathematical structure of the ELBO remains unchanged, highlighting the flexibility and generality of variational inference as a modeling paradigm.