Module 3: Invertible Models

Invertible Models and Normalizing Flows

Let be samples from an unknown distribution on . A generative model specifies a measurable map

where is a simple base distribution (e.g., standard Gaussian), and is a neural network parameterized by .

The standard learning principle is maximum likelihood. If denotes the density of (when it exists), then we estimate by

However, when is implemented via a generic simulator rather than defined through a closed-form expression, computing can be difficult. A key tractable case is when is an invertible and continuously differentiable map, in which case we can apply the change-of-variables formula.

Assume is invertible and continuously differentiable, with Jacobian . Let , where with density . Then the density of is

Here, is the Jacobian matrix of the inverse mapping . The formula has two parts: the first term, , accounts for the change of variable via , and the second term is a scaling factor introduced by the distortion of the mapping.

Remark 3.1.

For an invertible function, we have

where . Hence, we can also write

Proof

Proof. Recall that a function is the density of a random variable if and only if the following holds for all measurable functions :

We compute and express it in integral form to identify the density function of :

The last step uses the change-of-variables formula in integration: . ◻

Example 3.1.

Assume and define the affine transformation

where , with and an invertible matrix. Then , and . By the change of variables formula,

Hence,

This is exactly the density function of a Gaussian random variable with covariance .

MLE for Invertible Models

Consequently, the empirical log-likelihood becomes

To make this MLE computationally feasible in practice, we aim to design the model architecture of (which is typically modeled as a neural network in modern generative modeling) such that:

is invertible for all , and both and its inverse can be computed efficiently.
The determinant of the Jacobian matrix, as well as its derivative, can be evaluated efficiently and stably.

Normalizing Flows

One general approach to designing tractable, invertible models is to define the map as a composition of many simple, tractable, and invertible transforms:

where each is a transformation for which both the inverse and the Jacobian determinant can be computed efficiently.

The Jacobian of the composition can be written explicitly as a matrix product:

where and for .

Correspondingly, the determinant is a product of individual determinants:

Hence, the log-likelihood is given by

where . Therefore, maximum likelihood estimation reduces to evaluating the base density at and summing the log-determinants of the Jacobians across layers.

Each can be viewed as a building block. Different methods vary in how these blocks are designed. To ensure flexibility, we require a sufficient number of expressive blocks such that their composition can model complex distributions.

Triangular Maps

One common design is based on triangular maps. Let . A triangular map takes the form

where each is invertible in its last argument. The Jacobian is lower triangular:

and the log-determinant becomes a cheap sum of elementwise terms. This structure underlies masked autoregressive flows and related coupling-based designs.

Additive Coupling Layers.

Another design is based on coupling layers. Split and define a transformation of the form:

where and are arbitrary neural networks.

The transformation is invertible via

The Jacobian has the block form

It may not be immediately obvious, but the determinant of this Jacobian is always one. This follows from the block matrix determinant formula:

which holds when is invertible. In our case, , so

This design is also known as a reversible residual layer. Its key advantage is that it permits the use of arbitrary functions and while maintaining tractable invertibility.

Architectures with Efficient Log-Determinant

Coupling Layers (NICE/RealNVP/Glow)

Partition the input (by channels, checkerboard, or masks), and define an affine coupling transform

with flexible subnetworks . The Jacobian is block lower-triangular:

and inversion is closed-form:

Stacking multiple layers with alternating masks (and permutations) yields full-dimensional mixing. The additive special case (NICE) is volume-preserving () and extremely stable but less expressive per layer.

Invertible Convolution (Glow)

On images, apply a learned invertible to channels at each spatial location:

Then Parameterize via PLU/LU to guarantee invertibility and make stable. Combined with ActNorm (data-dependent affine normalization) and multi-scale “squeeze/factor-out”, Glow achieves strong likelihoods with fast inversion.

Autoregressive Flows (MAF/IAF)

Autoregressive parameterization yields a strictly triangular Jacobian. A common MAF forward transform is

MAF offers fast density evaluation (one pass), but sampling requires sequential inversion. Inverse autoregressive flows (IAF) swap the roles to make sampling fast (parallel forward of a masked network) at the cost of slower likelihood evaluation. This MAF/IAF duality lets us pick the right trade-off for density estimation versus generation speed.

Monotone Spline Couplings (Neural Spline Flows)

Replacing the affine coordinate-wise map by a monotone, invertible spline (e.g., rational—quadratic) improves expressivity while preserving closed-form inverse and exact log-det. This often narrows the gap to more flexible generative families while retaining the computational advantages of couplings.

Likelihood, Dequantization, and Reporting

Exact Likelihood and Gradients

By Eq. eq:cov, gradients decompose into a base-density term and a log-det term:

Because coupling/autoregressive layers keep analytic, flows train with standard first-order optimizers and are generally well-behaved.

Dequantization for Discrete Pixels

Images live on a discrete grid. To fit continuous flows, one dequantizes via , . Then Jensen’s inequality shows

so maximizing the RHS tightens a valid lower bound on the discrete log-likelihood. Variational dequantization further learns to tighten the bound.

Bits-Per-Dimension (bpd)

We report bpd as (with a dataset-specific constant for dequantization). This unit normalizes across resolutions and enables fair model comparisons.

Conditional Flows

For , inject condition into (or autoregressive nets) via concatenation, FiLM, or attention. The transform remains invertible for each fixed , so exact conditional likelihoods and fast conditional sampling are retained.

Continuous-Time Flows and Transport View

CNF / Neural ODE

A continuous-time flow evolves by an ODE

and the instantaneous change-of-variables formula states

The divergence can be stochastically estimated (Hutchinson trace), avoiding explicit Jacobians. CNFs offer fine-grained flexibility but require numerical integration; training and evaluation times thus hinge on solver tolerances.

Positioning vs. Diffusion/Score Models

Discrete flows: exact likelihoods, exact inverses, one-shot sampling; expressivity is governed by layer design. CNFs: flexible dynamics, exact likelihood via Eq. eq:icov, but integration cost. Diffusion/score: superb sample quality and simple training, yet likelihoods are inexact (or expensive) and sampling is multi-step. All can be unified under mass transport and continuity equations, differing in parameterization and numerical pathways.

Practical Design and Stability

Stable Parameterizations

Scale control: Bound (e.g., , clamping) to prevent exploding .
Normalization: ActNorm or data-dependent affine initialization improves early stability.
Permutation/mixing: Use invertible conv or channel permutations between couplings.
Multi-scale: Squeeze (spacechannels) and factor-out latents to shorten dependencies and ease optimization.

Diagnostics

Monitor the distribution of , bpd curves, and intermediate activations. Pathologies (e.g., overly negative or saturated scales) often pinpoint subnetworks that need regularization or rescaling.

Remark 3.2.

Maximum likelihood is mode-covering: it heavily penalizes under-estimating density on data regions. This complements adversarial (often mode-seeking) training and partly explains empirical differences in sample diversity.

Limitations and Trade-offs

Dimensionality and Discreteness

A bijection demands equal input—output dimension; truly discrete variables necessitate specialized invertible discrete layers or relaxation via dequantization.

Expressivity vs. Efficiency

Couplings/autoregressive layers restrict per-layer transforms to keep closed-form. Expressivity is then accrued via depth, mixing, and spline nonlinearity---each adds cost.

MAF/IAF Speed Asymmetry

Choose MAF when likelihood evaluation dominates (density modeling, anomaly detection); choose IAF when sampling speed is paramount (real-time generation, compression).

Mathematical Underpinnings (Sketches)

Change-of-Variables

For a diffeomorphism , any integrable satisfies

yielding Eq. eq:cov by taking and then logs.

Instantaneous Formula

From the continuity equation , evaluating along characteristics gives , and integrating over gives Eq. eq:icov.

Knothe—Rosenblatt Rearrangement

For sufficiently regular , there exists a monotone triangular map with . Autoregressive and monotone-spline layers approximate such transport maps numerically, supporting the expressivity of triangular-Jacobian flows.

Worked Log-Det Examples

Affine Coupling Sum of Scales

With Eq. eq:affine-coupling, the Jacobian is block lower-triangular with diagonal , hence and inversion is elementwise.

MAF Sum of Log-Scales

For Eq. eq:maf, and . Thus

Invertible Convolution

Treating each spatial location independently, is block-diagonal with copies of . Therefore , computable efficiently via LU with stable sign handling.

Module 3: Invertible Models

Invertible Models and Normalizing Flows

MLE for Invertible Models

Normalizing Flows

Triangular Maps

Additive Coupling Layers.

Architectures with Efficient Log-Determinant

Coupling Layers (NICE/RealNVP/Glow)

Invertible Convolution (Glow)

Autoregressive Flows (MAF/IAF)

Monotone Spline Couplings (Neural Spline Flows)

Likelihood, Dequantization, and Reporting

Exact Likelihood and Gradients

Dequantization for Discrete Pixels

Bits-Per-Dimension (bpd)

Conditional Flows

Continuous-Time Flows and Transport View

CNF / Neural ODE

Positioning vs. Diffusion/Score Models

Practical Design and Stability

Stable Parameterizations

Diagnostics

Limitations and Trade-offs

Dimensionality and Discreteness

Expressivity vs. Efficiency

MAF/IAF Speed Asymmetry

Mathematical Underpinnings (Sketches)

Change-of-Variables

Instantaneous Formula

Knothe—Rosenblatt Rearrangement

Worked Log-Det Examples

Affine Coupling Sum of Scales

MAF Sum of Log-Scales

Invertible Convolution

Further Reading (Pointers)