Language Models

  • Languages are discrete-valued sequences:

where is the vocabulary of all possible words (or tokens).

  • Discrete valued
  • Variable length
  • Sequential in nature
  • How to build generative models for such data?

Auto-regressive Language Models

  • It is natural to expand the probability using chain rule:
  • This allows for
    • Maximum likelihood estimation during training
      • Maximizing the log likelihood of the training data
  • Next token prediction during inference image

  • Universal question answering:

Neural Parameterization of Auto-regressive Models

  • Chain rule:
  • All are modeled with a neural network with a softmax head:

where is a neural network mapping to .

Generative Pre-trained Transformers (GPT)

  • : Taken to be a transformer with causal masks.
  • Key: All logits in a sentence can be calculated in parallel during training.
flowchart LR
    Tokens["tokens w1 ... wn"] --> Embed["token + position embeddings"]
    Embed --> Blocks["causal GPT blocks"]
    Blocks --> Head["linear head"]
    Head --> Logits["logits for all positions"]

Transformer-Based Language Models

Language models apply self-attention to sequences of tokens. Consider the sentence

We index tokens as and map both words and positions into vectors in . Inputs are typically formed by a sum of embeddings:

The resulting sequence is processed by alternating layers of self-attention and position-wise MLPs:

Position embeddings supply order information that self-attention alone does not encode. In autoregressive models, a causal mask guarantees that token only attends to positions , aligning the computation with left-to-right generation.

More Details on Transformer-Based LMs

  • Token-wise MLPs. The MLP layers act independently on each token’s hidden state (shared parameters across positions), providing nonlinear mixing of channel dimensions complementary to the cross-token mixing of attention.

  • Residual connections. Additive shortcuts are used throughout the stack to preserve gradient flow and allow layers to learn residual refinements of the representation.

  • Layer normalization. Normalization stabilizes optimization by reducing covariate shift within layers:

    with trainable scale and bias . It is applied at fixed points of the block (before/after sublayers) to keep activations in a favorable range.

  • Positional encodings. A common choice is sinusoidal position embeddings,

    which provide a deterministic, smooth encoding of order and relative offsets. (Other encodings are possible; here we focus on the sinusoidal case for clarity.)

  • Input composition. Inputs are typically the sum of word and position embeddings rather than their concatenation, keeping the model width fixed while allowing both content and order information to coexist in each token vector.

  • Causal masking. In autoregressive LMs (e.g., GPT-style), a strictly triangular attention mask enforces that token cannot attend to positions , ensuring the factorization needed for left-to-right likelihood and generation.

Overall, the attention mechanism provides content-adaptive aggregation; multi-head attention diversifies this aggregation across subspaces; self-attention enables all-to-all interaction within a layer; and positional information plus masking specialize the same machinery to the sequential constraints of language modeling. Together with residual connections, layer normalization, and token-wise MLPs, these components form the core computational pattern of Transformer-based LMs.