Conditional Variational Autoencoder: Mathematical Foundations
This document provides rigorous mathematical derivations for the Conditional Variational Autoencoder (CVAE) used in macro scenario generation.
1. Problem Formulation
Let $\mathbf{x} \in \mathbb{R}^{T \times D}$ denote a multivariate macro time series with $T$ time steps and $D$ variables, and let $c \in {0, 1, 2, 3}$ represent the scenario condition (baseline, adverse, severely adverse, stagflation).
| Objective: Learn a generative model $p_\theta(\mathbf{x} | c)$ that can sample realistic macro scenarios conditioned on scenario type. |
2. Variational Lower Bound (ELBO)
2.1 Marginal Likelihood Decomposition
Theorem 1 (ELBO for Conditional VAE)
| *For any conditional distribution $q_\phi(\mathbf{z} | \mathbf{x}, c)$ with support containing that of the true posterior $p_\theta(\mathbf{z} | \mathbf{x}, c)$, the log marginal likelihood satisfies:* |
where the Evidence Lower Bound (ELBO) is:
\[\mathcal{L}(\theta, \phi; \mathbf{x}, c) = \mathbb{E}_{q_\phi(\mathbf{z} | \mathbf{x}, c)}\big[\log p_\theta(\mathbf{x} | \mathbf{z}, c)\big] - D_{KL}\big(q_\phi(\mathbf{z} | \mathbf{x}, c) \| p(\mathbf{z} | c)\big) \tag{2}\]Proof
Starting from the marginal likelihood: $$ \log p_\theta(\mathbf{x} | c) = \log \int p_\theta(\mathbf{x}, \mathbf{z} | c) \, d\mathbf{z} $$ Introduce the variational distribution $q_\phi(\mathbf{z} | \mathbf{x}, c)$: $$ \log p_\theta(\mathbf{x} | c) = \log \int q_\phi(\mathbf{z} | \mathbf{x}, c) \frac{p_\theta(\mathbf{x}, \mathbf{z} | c)}{q_\phi(\mathbf{z} | \mathbf{x}, c)} \, d\mathbf{z} $$ By Jensen's inequality (noting $\log$ is concave): $$ \log p_\theta(\mathbf{x} | c) \geq \int q_\phi(\mathbf{z} | \mathbf{x}, c) \log \frac{p_\theta(\mathbf{x}, \mathbf{z} | c)}{q_\phi(\mathbf{z} | \mathbf{x}, c)} \, d\mathbf{z} $$ Expanding the joint: $$ = \int q_\phi(\mathbf{z} | \mathbf{x}, c) \log \frac{p_\theta(\mathbf{x} | \mathbf{z}, c) p(\mathbf{z} | c)}{q_\phi(\mathbf{z} | \mathbf{x}, c)} \, d\mathbf{z} $$ Separating terms: $$ = \underbrace{\mathbb{E}_{q_\phi}[\log p_\theta(\mathbf{x} | \mathbf{z}, c)]}_{\text{Reconstruction}} - \underbrace{D_{KL}(q_\phi(\mathbf{z} | \mathbf{x}, c) \| p(\mathbf{z} | c))}_{\text{Regularization}} $$ The gap between $\log p_\theta(\mathbf{x}|c)$ and the ELBO is exactly $D_{KL}(q_\phi \| p_\theta)$, which is non-negative. $\square$2.2 Reparameterization Trick
Lemma 1.1 (Differentiability via Reparameterization)
| *Let $q_\phi(\mathbf{z} | \mathbf{x}, c) = \mathcal{N}(\boldsymbol{\mu}\phi(\mathbf{x}, c), \text{diag}(\boldsymbol{\sigma}^2\phi(\mathbf{x}, c)))$. Then samples can be expressed as:* |
This transformation allows gradients to flow through the sampling operation.
Proof
For a random variable $Z \sim \mathcal{N}(\mu, \sigma^2)$, we can write $Z = \mu + \sigma \epsilon$ where $\epsilon \sim \mathcal{N}(0, 1)$. **Verification**: - $\mathbb{E}[Z] = \mathbb{E}[\mu + \sigma\epsilon] = \mu + \sigma \cdot 0 = \mu$ - $\text{Var}(Z) = \text{Var}(\mu + \sigma\epsilon) = \sigma^2 \text{Var}(\epsilon) = \sigma^2$ This extends element-wise to the multivariate case. The key insight is that $\epsilon$ is independent of $\phi$, so: $$ \nabla_\phi \mathbb{E}_{q_\phi}[f(\mathbf{z})] = \mathbb{E}_{\epsilon}\left[\nabla_\phi f(\boldsymbol{\mu}_\phi + \boldsymbol{\sigma}_\phi \odot \boldsymbol{\epsilon})\right] $$ which can be approximated via Monte Carlo sampling. $\square$2.3 KL Divergence Closed Form
Lemma 1.2 (KL Divergence for Gaussians)
For $q = \mathcal{N}(\boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2))$ and $p = \mathcal{N}(\mathbf{0}, \mathbf{I})$:
\[D_{KL}(q \| p) = \frac{1}{2} \sum_{j=1}^{d} \left( \sigma_j^2 + \mu_j^2 - 1 - \log \sigma_j^2 \right) \tag{4}\]Proof
For two multivariate Gaussians with diagonal covariances: - $q(\mathbf{z}) = \mathcal{N}(\boldsymbol{\mu}_q, \boldsymbol{\Sigma}_q)$ - $p(\mathbf{z}) = \mathcal{N}(\boldsymbol{\mu}_p, \boldsymbol{\Sigma}_p)$ The KL divergence is: $$ D_{KL}(q \| p) = \frac{1}{2}\left[\log\frac{|\boldsymbol{\Sigma}_p|}{|\boldsymbol{\Sigma}_q|} - d + \text{tr}(\boldsymbol{\Sigma}_p^{-1}\boldsymbol{\Sigma}_q) + (\boldsymbol{\mu}_p - \boldsymbol{\mu}_q)^\top \boldsymbol{\Sigma}_p^{-1}(\boldsymbol{\mu}_p - \boldsymbol{\mu}_q)\right] $$ For $p = \mathcal{N}(\mathbf{0}, \mathbf{I})$, we have $\boldsymbol{\Sigma}_p = \mathbf{I}$, $\boldsymbol{\mu}_p = \mathbf{0}$: $$ D_{KL} = \frac{1}{2}\left[-\log|\boldsymbol{\Sigma}_q| - d + \text{tr}(\boldsymbol{\Sigma}_q) + \boldsymbol{\mu}_q^\top\boldsymbol{\mu}_q\right] $$ With diagonal $\boldsymbol{\Sigma}_q = \text{diag}(\sigma_1^2, \ldots, \sigma_d^2)$: $$ = \frac{1}{2}\left[-\sum_j \log\sigma_j^2 - d + \sum_j \sigma_j^2 + \sum_j \mu_j^2\right] $$ $$ = \frac{1}{2}\sum_{j=1}^d \left(\sigma_j^2 + \mu_j^2 - 1 - \log\sigma_j^2\right) $$ $\square$3. Conditional Prior Design
3.1 Scenario-Dependent Prior
| In our implementation, the prior $p(\mathbf{z} | c)$ is scenario-dependent: |
where $\boldsymbol{\mu}_c$ and $\boldsymbol{\sigma}_c$ are learnable parameters for each scenario $c$.
Rationale: Different economic scenarios have distinct latent space characteristics:
- Baseline ($c=0$): Centered prior, moderate variance
- Adverse ($c=1$): Shifted mean toward stress, higher variance
- Severely Adverse ($c=2$): Larger shift, even higher variance
- Stagflation ($c=3$): Different direction of shift (high inflation, low growth)
3.2 Modified KL Term
With a non-standard prior, the KL divergence becomes:
\[D_{KL}(q_\phi(\mathbf{z}|\mathbf{x},c) \| p(\mathbf{z}|c)) = \frac{1}{2}\sum_{j=1}^d \left[\frac{\sigma_{\phi,j}^2 + (\mu_{\phi,j} - \mu_{c,j})^2}{\sigma_{c,j}^2} - 1 - \log\frac{\sigma_{\phi,j}^2}{\sigma_{c,j}^2}\right] \tag{6}\]4. Reconstruction Loss
4.1 Gaussian Decoder (MSE Loss)
| Assuming $p_\theta(\mathbf{x} | \mathbf{z}, c) = \mathcal{N}(\boldsymbol{\mu}\theta(\mathbf{z}, c), \sigma{\text{dec}}^2 \mathbf{I})$: |
Maximizing this is equivalent to minimizing MSE loss:
\[\mathcal{L}_{\text{recon}} = \frac{1}{TD}\sum_{t=1}^T \sum_{d=1}^D (x_{t,d} - \hat{x}_{t,d})^2 \tag{8}\]4.2 Alternative: Negative Log-Likelihood
For heteroscedastic noise, we can learn variance:
\[p_\theta(\mathbf{x} | \mathbf{z}, c) = \mathcal{N}(\boldsymbol{\mu}_\theta(\mathbf{z}, c), \text{diag}(\boldsymbol{\sigma}_\theta^2(\mathbf{z}, c)))\] \[\mathcal{L}_{\text{NLL}} = \frac{1}{2}\sum_{t,d}\left[\log\sigma_{\theta,t,d}^2 + \frac{(x_{t,d} - \mu_{\theta,t,d})^2}{\sigma_{\theta,t,d}^2}\right] \tag{9}\]5. Beta-VAE and KL Annealing
5.1 Beta-VAE Objective
Definition (Beta-VAE Loss)
\[\mathcal{L}_\beta = \mathbb{E}_{q_\phi}[\log p_\theta(\mathbf{x}|\mathbf{z},c)] - \beta \cdot D_{KL}(q_\phi(\mathbf{z}|\mathbf{x},c) \| p(\mathbf{z}|c)) \tag{10}\]- $\beta < 1$: Emphasizes reconstruction, may lead to posterior collapse
- $\beta = 1$: Standard VAE (ELBO)
- $\beta > 1$: Emphasizes disentanglement, may hurt reconstruction
5.2 KL Annealing Schedule
Algorithm 1: Cyclical KL Annealing
1
2
3
4
5
6
7
8
Input: Total epochs E, cycle length C, warmup fraction r
For epoch e in 1..E:
cycle_position = (e % C) / C
if cycle_position < r:
beta = cycle_position / r
else:
beta = 1.0
Train with beta
Schedule Types:
-
Linear Warmup: \(\beta_t = \min\left(1, \frac{t}{T_{\text{warmup}}}\right) \tag{11}\)
-
Cyclical Annealing (used in our implementation): \(\beta_t = \min\left(1, \frac{t \mod C}{r \cdot C}\right) \tag{12}\)
-
Sigmoid Annealing: \(\beta_t = \sigma\left(\frac{t - T_{\text{mid}}}{T_{\text{scale}}}\right) \tag{13}\)
5.3 Posterior Collapse Prevention
| Definition (Posterior Collapse): When $q_\phi(\mathbf{z} | \mathbf{x},c) \approx p(\mathbf{z} | c)$ for all $\mathbf{x}$, meaning the encoder ignores the input. |
Theorem 2 (Posterior Collapse Condition)
| *Posterior collapse occurs when the decoder is powerful enough to model $p(\mathbf{x} | c)$ without latent information, i.e., when:* |
Mitigation Strategies:
- KL annealing (start with $\beta \ll 1$)
- Free bits: $D_{KL} \leftarrow \max(\lambda, D_{KL})$
- Skip connections bypass (avoid powerful decoders)
6. LSTM Encoder Architecture
6.1 Sequence Encoding
The encoder maps $\mathbf{x} \in \mathbb{R}^{T \times D}$ to latent parameters:
\[\mathbf{h}_t = \text{LSTM}(\mathbf{x}_t, \mathbf{h}_{t-1}) \tag{15}\] \[\boldsymbol{\mu}_\phi = W_\mu \mathbf{h}_T + \mathbf{b}_\mu, \quad \log\boldsymbol{\sigma}_\phi^2 = W_\sigma \mathbf{h}_T + \mathbf{b}_\sigma \tag{16}\]6.2 Condition Embedding
The scenario condition is embedded and concatenated:
\[\mathbf{e}_c = \text{Embedding}(c) \in \mathbb{R}^{d_c} \tag{17}\] \[\tilde{\mathbf{x}}_t = [\mathbf{x}_t; \mathbf{e}_c] \in \mathbb{R}^{D + d_c} \tag{18}\]7. Decoder Architecture
7.1 Latent to Sequence
The decoder generates sequences from latent codes:
\[\mathbf{s}_0 = W_{\text{init}}[\mathbf{z}; \mathbf{e}_c] + \mathbf{b}_{\text{init}} \tag{19}\] \[\mathbf{s}_t = \text{LSTM}(\mathbf{s}_{t-1}, \mathbf{s}_{t-1}) \tag{20}\] \[\hat{\mathbf{x}}_t = W_{\text{out}}\mathbf{s}_t + \mathbf{b}_{\text{out}} \tag{21}\]7.2 Trend Component
To capture long-term dynamics, we add a deterministic trend:
\[\mathbf{trend}_t = \mathbf{a}_c \cdot t + \mathbf{b}_c \tag{22}\] \[\hat{\mathbf{x}}_t^{\text{final}} = \hat{\mathbf{x}}_t + \mathbf{trend}_t \tag{23}\]8. Training Objective
8.1 Final Loss Function
\[\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{recon}} + \beta \cdot \mathcal{L}_{KL} + \lambda_{\text{corr}} \cdot \mathcal{L}_{\text{corr}} \tag{24}\]where:
- $\mathcal{L}_{\text{recon}}$: MSE reconstruction loss (Eq. 8)
- $\mathcal{L}_{KL}$: KL divergence (Eq. 6)
- $\mathcal{L}_{\text{corr}}$: Correlation regularization
8.2 Correlation Regularization
To preserve cross-variable dependencies:
\[\mathcal{L}_{\text{corr}} = \|\mathbf{R}_{\text{true}} - \mathbf{R}_{\text{gen}}\|_F^2 \tag{25}\]where $\mathbf{R} \in \mathbb{R}^{D \times D}$ is the correlation matrix.
9. Theoretical Guarantees
Theorem 3 (Consistency of VAE)
Under mild regularity conditions, as sample size $n \to \infty$:
\[\hat{\theta}_n \xrightarrow{p} \theta^* \quad \text{and} \quad \hat{\phi}_n \xrightarrow{p} \phi^* \tag{26}\]where $(\theta^, \phi^)$ maximizes the population ELBO.
| Corollary 3.1: *The generative model $p_{\hat{\theta}}(\mathbf{x} | c)$ converges to the true data distribution as $n \to \infty$, provided the model class is correctly specified.* |
10. Numerical Example
Consider a simple 1D case with:
- Prior: $p(z) = \mathcal{N}(0, 1)$
- Encoder output: $\mu_\phi = 0.5$, $\sigma_\phi = 0.8$
KL Divergence Calculation:
\[D_{KL} = \frac{1}{2}(0.8^2 + 0.5^2 - 1 - \log 0.8^2)\] \[= \frac{1}{2}(0.64 + 0.25 - 1 - (-0.446))\] \[= \frac{1}{2}(0.336) = 0.168\]References
- Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. ICLR.
- Higgins, I., et al. (2017). beta-VAE: Learning basic visual concepts with a constrained variational framework. ICLR.
- Bowman, S. R., et al. (2016). Generating sentences from a continuous space. CoNLL.
- Sohn, K., Lee, H., & Yan, X. (2015). Learning structured output representation using deep conditional generative models. NeurIPS.
- Fu, H., et al. (2019). Cyclical annealing schedule: A simple approach to mitigating KL vanishing. NAACL.