Let be samples from an unknown distribution on . A generative model specifies a measurable map
where is a simple base distribution (e.g., standard Gaussian), and is a neural network parameterized by .
The standard learning principle is maximum likelihood. If denotes the density of (when it exists), then we estimate by
However, when is implemented via a generic simulator rather than defined through a closed-form expression, computing can be difficult. A key tractable case is when is an invertible and continuously differentiable map, in which case we can apply the change-of-variables formula.
Assume is invertible and continuously differentiable, with Jacobian . Let , where with density . Then the density of is
Here, is the Jacobian matrix of the inverse mapping . The formula has two parts: the first term, , accounts for the change of variable via , and the second term is a scaling factor introduced by the distortion of the mapping.
Remark 3.1.
For an invertible function, we have
where . Hence, we can also write
Proof
Proof. Recall that a function is the density of a random variable if and only if the following holds for all measurable functions :
We compute and express it in integral form to identify the density function of :
The last step uses the change-of-variables formula in integration: . ◻
Example 3.1.
Assume and define the affine transformation
where , with and an invertible matrix. Then , and . By the change of variables formula,
Hence,
This is exactly the density function of a Gaussian random variable with covariance .
MLE for Invertible Models
Consequently, the empirical log-likelihood becomes
To make this MLE computationally feasible in practice, we aim to design the model architecture of (which is typically modeled as a neural network in modern generative modeling) such that:
is invertible for all , and both and its inverse can be computed efficiently.
The determinant of the Jacobian matrix, as well as its derivative, can be evaluated efficiently and stably.
Normalizing Flows
One general approach to designing tractable, invertible models is to define the map as a composition of many simple, tractable, and invertible transforms:
where each is a transformation for which both the inverse and the Jacobian determinant can be computed efficiently.
The Jacobian of the composition can be written explicitly as a matrix product:
where and for .
Correspondingly, the determinant is a product of individual determinants:
Hence, the log-likelihood is given by
where . Therefore, maximum likelihood estimation reduces to evaluating the base density at and summing the log-determinants of the Jacobians across layers.
Each can be viewed as a building block. Different methods vary in how these blocks are designed. To ensure flexibility, we require a sufficient number of expressive blocks such that their composition can model complex distributions.
Triangular Maps
One common design is based on triangular maps. Let . A triangular map takes the form
where each is invertible in its last argument. The Jacobian is lower triangular:
so
and the log-determinant becomes a cheap sum of elementwise terms. This structure underlies masked autoregressive flows and related coupling-based designs.
Additive Coupling Layers.
Another design is based on coupling layers. Split and define a transformation of the form:
where and are arbitrary neural networks.
The transformation is invertible via
The Jacobian has the block form
It may not be immediately obvious, but the determinant of this Jacobian is always one. This follows from the block matrix determinant formula:
which holds when is invertible. In our case, , so
This design is also known as a reversible residual layer. Its key advantage is that it permits the use of arbitrary functions and while maintaining tractable invertibility.
Architectures with Efficient Log-Determinant
Coupling Layers (NICE/RealNVP/Glow)
Partition the input (by channels, checkerboard, or masks), and define an affine coupling transform
with flexible subnetworks . The Jacobian is block lower-triangular:
and inversion is closed-form:
Stacking multiple layers with alternating masks (and permutations) yields full-dimensional mixing. The additive special case (NICE) is volume-preserving () and extremely stable but less expressive per layer.
Invertible Convolution (Glow)
On images, apply a learned invertible to channels at each spatial location:
Then Parameterize via PLU/LU to guarantee invertibility and make stable. Combined with ActNorm (data-dependent affine normalization) and multi-scale “squeeze/factor-out”, Glow achieves strong likelihoods with fast inversion.
Autoregressive Flows (MAF/IAF)
Autoregressive parameterization yields a strictly triangular Jacobian. A common MAF forward transform is
MAF offers fast density evaluation (one pass), but sampling requires sequential inversion. Inverse autoregressive flows (IAF) swap the roles to make sampling fast (parallel forward of a masked network) at the cost of slower likelihood evaluation. This MAF/IAF duality lets us pick the right trade-off for density estimation versus generation speed.
Monotone Spline Couplings (Neural Spline Flows)
Replacing the affine coordinate-wise map by a monotone, invertible spline (e.g., rational—quadratic) improves expressivity while preserving closed-form inverse and exact log-det. This often narrows the gap to more flexible generative families while retaining the computational advantages of couplings.
Likelihood, Dequantization, and Reporting
Exact Likelihood and Gradients
By Eq. eq:cov, gradients decompose into a base-density term and a log-det term:
Because coupling/autoregressive layers keep analytic, flows train with standard first-order optimizers and are generally well-behaved.
Dequantization for Discrete Pixels
Images live on a discrete grid. To fit continuous flows, one dequantizes via , . Then Jensen’s inequality shows
so maximizing the RHS tightens a valid lower bound on the discrete log-likelihood. Variational dequantization further learns to tighten the bound.
Bits-Per-Dimension (bpd)
We report bpd as (with a dataset-specific constant for dequantization). This unit normalizes across resolutions and enables fair model comparisons.
Conditional Flows
For , inject condition into (or autoregressive nets) via concatenation, FiLM, or attention. The transform remains invertible for each fixed , so exact conditional likelihoods and fast conditional sampling are retained.
Continuous-Time Flows and Transport View
CNF / Neural ODE
A continuous-time flow evolves by an ODE
and the instantaneous change-of-variables formula states
The divergence can be stochastically estimated (Hutchinson trace), avoiding explicit Jacobians. CNFs offer fine-grained flexibility but require numerical integration; training and evaluation times thus hinge on solver tolerances.
Positioning vs. Diffusion/Score Models
Discrete flows: exact likelihoods, exact inverses, one-shot sampling; expressivity is governed by layer design. CNFs: flexible dynamics, exact likelihood via Eq. eq:icov, but integration cost. Diffusion/score: superb sample quality and simple training, yet likelihoods are inexact (or expensive) and sampling is multi-step. All can be unified under mass transport and continuity equations, differing in parameterization and numerical pathways.
Practical Design and Stability
Stable Parameterizations
Scale control: Bound (e.g., , clamping) to prevent exploding .
Normalization: ActNorm or data-dependent affine initialization improves early stability.
Permutation/mixing: Use invertible conv or channel permutations between couplings.
Multi-scale: Squeeze (spacechannels) and factor-out latents to shorten dependencies and ease optimization.
Diagnostics
Monitor the distribution of , bpd curves, and intermediate activations. Pathologies (e.g., overly negative or saturated scales) often pinpoint subnetworks that need regularization or rescaling.
Remark 3.2.
Maximum likelihood is mode-covering: it heavily penalizes under-estimating density on data regions. This complements adversarial (often mode-seeking) training and partly explains empirical differences in sample diversity.
Limitations and Trade-offs
Dimensionality and Discreteness
A bijection demands equal input—output dimension; truly discrete variables necessitate specialized invertible discrete layers or relaxation via dequantization.
Expressivity vs. Efficiency
Couplings/autoregressive layers restrict per-layer transforms to keep closed-form. Expressivity is then accrued via depth, mixing, and spline nonlinearity---each adds cost.
MAF/IAF Speed Asymmetry
Choose MAF when likelihood evaluation dominates (density modeling, anomaly detection); choose IAF when sampling speed is paramount (real-time generation, compression).
Mathematical Underpinnings (Sketches)
Change-of-Variables
For a diffeomorphism , any integrable satisfies
yielding Eq. eq:cov by taking and then logs.
Instantaneous Formula
From the continuity equation , evaluating along characteristics gives , and integrating over gives Eq. eq:icov.
Knothe—Rosenblatt Rearrangement
For sufficiently regular , there exists a monotone triangular map with . Autoregressive and monotone-spline layers approximate such transport maps numerically, supporting the expressivity of triangular-Jacobian flows.
Worked Log-Det Examples
Affine Coupling Sum of Scales
With Eq. eq:affine-coupling, the Jacobian is block lower-triangular with diagonal , hence and inversion is elementwise.
MAF Sum of Log-Scales
For Eq. eq:maf, and . Thus
Invertible Convolution
Treating each spatial location independently, is block-diagonal with copies of . Therefore , computable efficiently via LU with stable sign handling.
Further Reading (Pointers)
NICE/RealNVP (coupling; tractable log-det), Glow (invertible conv; multi-scale), MAF/IAF (autoregressive duality), Neural Spline Flows (monotone splines), FFJORD/CNF (continuous-time with trace estimators). Each navigates the triangle of invertibility—expressivity—efficiency differently, and the right choice depends on whether likelihood accuracy, sampling speed, or representational power is the primary design goal.