Back to Module 1 notes

Problem 1

Problem 1.1 (Gaussian Density Operations).

Let be the density of for (with ), defined on . Answer the following:

  1. Let

    Is a valid density? If so, identify the distribution explicitly, including its parameters.

  2. Let

    Is a valid density? If so, describe its distributional form. Is it generally a single Gaussian?

  3. Let , where and are independent. What is the distribution of ? Give its mean and variance.

  4. Let where . Derive the density of and state its support.

Problem 2

Problem 1.2 (KL Divergence).
  1. Consider two discrete distributions over :

    Hand-calculate and .

  2. Now let the sample space be and consider

    Compute and . Use the convention that , because , and that for .

  3. For a discrete distribution on , define , which is the set of all elements with positive probability.

    1. Assume , must it be true that ? Briefly justify.
    2. Assume , must it be true that ? If yes, explain; if not, give a counterexample and explain why.
  4. Consider the following divergence:

    Here , and and are the densities of and respectively. Answer the following questions:

    1. Is this divergence a valid notion of discrepancy? Explain your reasoning.
    2. Under what conditions does this divergence reduce to the KL divergence (either or )?

Problem 3

Problem 1.3 (Categorical MLE with Softmax).

Let be i.i.d. observations taking values in . We parameterize the (unconditional) categorical probabilities via a softmax:

Here are unconstrained parameters. Note: the softmax is invariant to adding a constant to all coordinates, i.e. for any .

Exercise:

  1. Write down the log-likelihood for this model (you may express it using the empirical counts ).

  2. Compute the gradient and set it to zero to derive the maximum likelihood estimator . Discuss: Is the parameter unique? Is the induced distribution unique?

  3. Directly evaluating exponentials can overflow/underflow. For each expression below, state whether it is numerically stable (in standard 64-bit floating point) and explain briefly:

    1. ,
    2. ,
    3. .

    (Clarify whether it may produce Inf/NaN or loss of significance.)

  4. Describe a numerically stable way to compute the softmax for a general vector , and give a stable formula for the log-likelihood.

Problem 4

Problem 1.4 (Energy-Based Models with Langevin Sampling).

We want to fit a dataset with an energy-based model on of the form

Here is the (unnormalized) log-density (i.e., negative energy), and is the partition function. To ensure integrability, we use

with a neural network (e.g. MLP) parameterized by , , and a scalar so the Gaussian term is isotropic (). We write ; for simplicity, treat as fixed hyperparameters (e.g. , ) unless you wish to tune them manually.

You will implement a toy MLE pipeline in and test it on the provided dataset in the starter Colab.

https://colab.research.google.com/drive/1aNetPvIM2LH2PinAKQxVs_Utwpyy4uYn?usp=sharing

  1. Sampling (Langevin vs. grid). Exact sampling from is not available. Besides a provided brute-force grid sampler (on a bounded D window with discretization), implement Langevin Algorithm:

    with stepsize . It is expected that approximately follows when the number of steps is very large and the step size is very small. Note: since does not depend on , . Initialize, e.g., ; run multiple times.

    Task: Implement Langevin dynamics and qualitatively compare to the grid sampler.

  2. MLE training. Define the average log-likelihood

    where is the empirical distribution of the dataset. The negative log-likelihood is therefore

    In the lecture we have shown that

    Task: Implement gradient descent on the negative log-likelihood , equivalently gradient ascent on :

    where the model expectation is approximated with samples from your Langevin sampler at current . Train the model until it fits the toy data well (e.g., samples visually match data).

Optional Problems

Optional Problem 1

Problem 1.5 (Laplace MLE).

Consider a dataset generated from a distribution with density function:

This is known as the Laplace (or double exponential) distribution.

Exercise:

  1. Write down the log-likelihood function for this distribution.
  2. Find the maximum likelihood estimator by maximizing .
  3. Show that can be written as a simple function of .

Optional Problem 2

Problem 1.6 (Exponential MLE).

Consider a dataset of nonnegative numbers generated from an exponential distribution with density:

Exercise:

  1. Write down the log-likelihood function .
  2. Find the maximum likelihood estimator by maximizing .
  3. Show that , where is the sample mean.

Optional Problem 3

Problem 1.7 (Poisson MLE).

Consider a dataset of non-negative integers following a Poisson distribution with probability mass function:

Exercise:

  1. Write down the log-likelihood function .
  2. Find the maximum likelihood estimator .
  3. Prove that equals the sample mean of the observations.