Module 5: Autoencoder Models

Recall that we aim to learn a mapping that transforms noise into data-like samples . The classical approach based on maximum likelihood estimation (MLE) requires to be invertible and have a tractable Jacobian, so that the change-of-variables formula can be computed explicitly. This restriction motivates specialized architectures such as normalizing flows, which guarantee invertibility but limit flexibility. Generative adversarial networks (GANs) lift this restriction by allowing arbitrary, non-invertible mappings , but replace likelihood maximization with a challenging minimax game between a generator and a discriminator, often leading to instability in training.

We now turn to an alternative, autoencoder-based approach that relies on approximate invertibility. Instead of enforcing to be exactly invertible, this framework introduces an auxiliary neural network that approximates its inverse:

The two networks are trained jointly to achieve approximate reconstruction,

allowing to remain flexible and expressive while preserving a weak notion of invertibility sufficient for learning meaningful latent representations.

We start with introducing the idea of autoencoders, which is general idea of using pairs of encoders and decoders for representation, and then discuss how to use autoencoders to build generative models such as variational encoders and adverairal antocoders.

Autoencoders: Non-Generative

flowchart LR
    X["input X"] --> Encoder["encoder E_phi"]
    Encoder --> Z["latent code Z"]
    Z --> Decoder["decoder D_theta"]
    Decoder --> Xhat["reconstruction X_hat"]

Structure of an autoencoder. The encoder (left) compresses input data into a latent code , and the decoder (right) reconstructs from it.

Example 5.1.

In linear encoder—decoder models, we have

with and , where are the dimensions of and , respectively. The autoencoder minimizes the expected squared reconstruction error:

Let , which has rank at most . Then the problem becomes

This optimization has a closed form solution related to PCA/SVD.

Define , and let be the eigendecomposition of the covariance matrix, where

By the Eckart—Young—Mirsky theorem, the optimal solution above is given by the orthogonal projector onto the top- eigenspace:

A valid factorization achieving this is obtained via:

leading to the reconstruction

This is exactly the PCA reconstruction from the top principal components.

The minimum achievable loss equals the sum of the discarded eigenvalues (residual variance):

Equivalently, if the centered data matrix has the singular value decomposition , the optimal linear autoencoder spans the same subspace as the top- left singular vectors .

Remark 5.1.

Note that the solution of the autoencoder is not unique by definition. Given any encoder—decoder pair , we can construct another pair with the same reconstruction mapping and hence the same loss. Specifically, for any invertible matrix , define

where denotes function composition. Then , so the reconstruction and the loss remain unchanged. This shows that the latent representation is only defined up to an arbitrary invertible linear transformation of the latent space.

An autoencoder (AE) learns to represent data through a pair of neural networks: an encoder that maps the input to a low-dimensional latent code , and a decoder that reconstructs the input from the code, . Typically, the latent dimension satisfies , enforcing information compression.

The training objective minimizes the reconstruction error:

Autoencoders are widely used for dimensionality reduction, denoising, and pretraining in deep learning. However, they are not inherently generative: the latent representation is unconstrained and does not necessarily follow a known prior distribution, such as a standard Gaussian. As a result, the learned latent space may have a complex or irregular structure, making it difficult to sample valid values for generating new data. In other words, without an explicit prior on , generation from an autoencoder remains ill-defined.

Generative Autoencoders

Autoencoders can be made generative by introducing a regularization term that constrains the distribution of the latent variable . The idea is to encourage the encoder outputs , where , to follow a simple noise prior , such as a standard Gaussian. The training objective becomes:

where denotes the distribution of encoded latents, and measures their discrepancy. This regularization bridges the gap between deterministic autoencoders and true generative models, enabling sampling of new data from the learned latent distribution.

Different generative autoencoder variants are distinguished by the choice of the divergence :

Adversarial Autoencoders (AAE) use a GAN-based adversarial loss to match to .
Variational Autoencoders (VAE) use the Kullback—Leibler (KL) divergence, with a stochastic encoder that enables a meaningful KL divergence computation.

Adversarial Autoencoders (AAE)

The Adversarial Autoencoder (AAE) combines the reconstruction principle of autoencoders with the adversarial training mechanism of GANs to regularize the latent space. It enforces the encoded latent distribution to match a simple prior by adding a GAN-style divergence term to the reconstruction loss:

where measures the discrepancy between the encoded distribution and the prior . It can be implemented using any off-the-shelf GAN method, which uses a discriminator (or critic) that attempts to distinguish encoded samples from true noise samples:

where depends on the specific GAN formulation (e.g., Wasserstein, Jensen—Shannon, or f-GAN).

Full Objective.

The overall optimization becomes a min—max game:

This formulation integrates the reconstruction power of autoencoders with the distributional matching capability of GANs, yielding a model that can both encode data and generate new samples.

Figure 5.1. Architecture of an Adversarial Autoencoder (AAE), combining an autoencoder with a GAN-style discriminator to align the latent distribution P_Z^\phi with a prior P_\text{noise}. Source: Makhzani et al., Adversarial Autoencoders (2015).

Variational Autoencoders (VAE)

The Variational Autoencoder (VAE) [Kingma & Welling, 2014] introduces a probabilistic formulation of the autoencoder by combining reconstruction with a KL divergence penalty:

Here, is the distribution of latent variables induced by the encoder , and is a simple prior, typically .

Motivation.

The deterministic encoder makes an implicit distribution, for which the KL divergence cannot be computed analytically. VAE resolves this issue by introducing a stochastic encoder that yields an explicit Gaussian form.

Stochastic Encoder.

Each input is encoded as a Gaussian distribution rather than a point:

where denotes elementwise product. Thus, the conditional distribution of the latent variable becomes

To align with the prior , we want encourage and . Although it is possible to use a vanilla square loss and , it is more natural to use KL divergence as follows.

Remark 5.2.

The divergence between two Gaussian distributions admits a closed-form expression:

Minimizing this yields and .

Applying the formula above, the overall KL regularization between the approximate posterior and the prior is

Reconstruction Term.

Given this stochastic encoder, the reconstruction loss becomes

which measures how well the decoder can reconstruct the data from the sampled latent code.

Overall Objective.

The VAE jointly minimizes the reconstruction error and the KL divergence:

where balances reconstruction fidelity against latent regularization.

The Trade-off.

The coefficient controls the strength of the KL term: higher promotes more disentangled latent representations but can degrade reconstruction quality. This idea leads to the -VAE framework [Higgins et al., 2017], interpreting the VAE as learning a constrained variational representation of data.

See example code here.

VAE: Probabilistic View

We develop a probabilistic perspective of the Variational Autoencoder (VAE), viewing it as a latent variable model defined by joint densities over observed and hidden variables. The key challenge---marginalizing over latent variables---is addressed through variational inference, which converts intractable integrations into tractable optimization problems.

Latent Variable Models

Setup.

We observe data samples drawn from an unknown distribution . To model such data, we introduce an unobserved (latent) variable and define a generative process through a joint density over :

where is the prior distribution over the latent space, and is the conditional likelihood of data given .

Marginal Likelihood.

Since the latent variable is unobserved, the model defines a marginal density for by integrating out :

If both pairs were observed, one could directly maximize the joint log-likelihood:

However, in practice only are observed, so we maximize the marginal log-likelihood:

In words, we maximize the log-likelihood of what we observe, and integrate (marginalize) over what we do not.

However, for deep generative models, the integral in is typically intractable, since the latent variable appears inside a nonlinear neural network. Variational inference provides an elegant solution: it replaces the difficult integration over with an optimization problem over a tractable family of distributions. This leads naturally to the Variational Autoencoder, where the encoder network serves as an approximate inference model for .

VAE as a Latent Variable Model

Consider a latent variable model where the latent variable follows a standard Gaussian prior, and the data is generated from a conditional Gaussian distribution given :

or equivalently,

This stochastic formulation corresponds to a decoder network with additive Gaussian noise of variance .

Model Densities.

The corresponding densities are:

Hence, the joint distribution factorizes as

Learning.

If both were observed, maximum likelihood estimation (MLE) would maximize the joint log-likelihood:

However, in practice we only observe , and must maximize the marginal log-likelihood:

The integral over is generally intractable for neural decoders , motivating the use of variational inference, as employed in the Variational Autoencoder.

Variational Inference: Integration Optimization

Quantities like marginal likelihood above requires to compute an integral of the form

which is typically intractable in high-dimensional latent-variable models.

Introducing an Auxiliary Distribution.

Let be any valid probability density. We can rewrite the integral as an expectation under :

This simple identity forms the basis of importance sampling and variational inference.

Applying Jensen’s Inequality.

Taking the logarithm of both sides and applying Jensen’s inequality gives

where the inequality holds because the logarithm is a concave function. The lower bound is called the Evidence Lower Bound (ELBO) in variational inference. Optimizing it over transforms the original integration problem into an optimization problem.

Remark 5.3.

For a random variable and a concave function ,

and the inequality reverses for convex functions:

Illustration of Jensen’s inequality for convex (left) and concave (right) functions.

Gibbs (or Donsker—Varadhan) Variational Principle

The inequality in Eq. equ:logfineq becomes an equality when exactly matches the normalized version of , that is,

To see this, note that for this optimal choice,

Combining Eq. equ:logfineq and Eq. equ:logfeq, we can express the logarithm of an integral as an optimization problem:

where is maximized in the set of all possible distributions .

Further reformulation yields

where is the entropy of distribution :

This identity is known as the Gibbs variational principle, or Donsker—Varadhan representation, and is fundamental to variational inference. It shows that integration can be replaced by an optimization over distributions , combining two opposing forces: the expected log-likelihood term and the entropy term encouraging diversity in .

Variational Inference for VAE

From the Gibbs variational principle, we have:

Applying this identity to the marginal likelihood of a latent variable model with (viewing as fixed), we obtain:

The distribution serves as an approximate posterior or a probabilistic encoder, typically parameterized by a neural network with parameters .

The expression inside the maximization is known as the Evidence Lower Bound (ELBO):

Maximizing the ELBO jointly over provides a tractable surrogate for maximizing the intractable marginal likelihood .

Expanded Form.

Since , the ELBO can be rewritten as:

where the first term encourages accurate reconstruction (expected log-likelihood) and the second term regularizes the latent posterior to stay close to the prior.

Joint Optimization.

In practice, both and are learned together via:

This variational reformulation converts the intractable integral in into an optimization problem over the encoder distribution ---the core idea behind the Variational Autoencoder.