Sampling Multinomial Distribution

  • break interval [0,1] into levels proportional to

  • sample uniform variable

  • threshold with levels

  • need to compute levels

  • for data type of size D with C possible outcomes, complexity =

Maximum Likelihood Learning

Likelihood of model computed on is

  • if i.i.d. dataset ( are independently sampled), then
  • train model by maximizing the log-likelihood function

Justification

  • intuitive:

    • in a huge data-space, collected samples are most likely ones
    • probability of these samples happening together is high
  • Kullback-Leibler Divergence

    • divergence=0 -> matched

    • divergent>0 -> unmatched

    • divergence is not symmetric,

    • can be estimated using the law of large numbers

  • Law of Large Numbers states that the sample mean will converge to the true mean as the number of trails approaches infinity

  • mathematical:

    • maximizing likelihood is the same as minimizing KL divergence between the model and data distribution

Data Entropy is the inherit randomness of the data, which limits the error to non zero values

Autoregression

00 - Meta/01 - Attachments/ECE1508 L3 - Generation by Explicit Distribution Learning.png

Model

  • chain-rule:

  • d dimensional object can be decomposed into product of 1-dimensional objects

  • for a trained model, all samples are of complexity

  • sampling complexity is linear

  • very slow since it must be done sequentially

Computational

  • generally only use a single model,
  • similar to LMs, sequence compute context and context computes distribution

Generation is done sequentially (autoregressive)

Training

Note

refers to data samples
refers to entries in sample

  • teacher forcing: no need to feed output back in, rely on actual samples. this enables parallel processing

    • leads to expose bias, mismatch between input distributions during training vs. generation

    • resolve with scheduled sampling: gradually replace true samples with generated

    • or reinforcement learning: introduce reward system

Examples of AR Models

RNN-based AR Models

  • is sample from learned prior

  • compute context for a given with , and then predict

  • iterate model over d entries

  • diagonal processing uses only pixels above and left of the new pixel

  • reduces iterations to 2d-1

  • model is different due to incorrect chain rule

00 - Meta/01 - Attachments/ECE1508 L3 - Generation by Explicit Distribution Learning 5.png
00 - Meta/01 - Attachments/ECE1508 L3 - Generation by Explicit Distribution Learning 6.png

Convolutional AR Models

  • does not normally extract context
  • transformers are used to extract context
  • use masked filters to eliminate future entries
    • assume
  • use a deep convolution NN to get larger context area,
    • each layer uses a smaller mask, effective receptive field expands

00 - Meta/01 - Attachments/ECE1508 L3 - Generation by Explicit Distribution Learning 2.png
00 - Meta/01 - Attachments/ECE1508 L3 - Generation by Explicit Distribution Learning 3.png

Transformer-based AR Models

00 - Meta/01 - Attachments/ECE1508 L3 - Generation by Explicit Distribution Learning 4.png

Energy Based Models

Model

  • need a model that computes distribution at any sample

  • properties of distributions

  • Boltzmann distribution

    • denominator is partition function , which requires operations. and is need for both sampling and training EBMs
  • Energy-based models use boltzmann constant to convert function into distribution

Computational

  • Boltzmann Machines: simple quadratic energy functions (single linear layer)
  • Neural: deep NN as energy function (non-linear)

Training

  • empirical expectation:

  • can train EBM with MLE:

    00 - Meta/01 - Attachments/ECE1508 L3 - Generation by Explicit Distribution Learning 8.png

  • challenge is to sample data distribution

Sampling

  • Markov Chain Monte Carlo: each sample depends only on the previous, will converge to target over time

    • zero order: use target distribution directly
    • first order: use
    • second order: use Hessian
  • Gibbs Sampling

    00 - Meta/01 - Attachments/ECE1508 L3 - Generation by Explicit Distribution Learning 9.png

    • burn-in period is time required for distribution to match target

Boltzmann Machine

00 - Meta/01 - Attachments/ECE1508 L3 - Generation by Explicit Distribution Learning 10.png
00 - Meta/01 - Attachments/ECE1508 L3 - Generation by Explicit Distribution Learning 11.png
ECE1508 L3 - Generation by Explicit Distribution Learning 12.png

Langevin Dynamics

  • samples from with only energy function

  • conservative sampling starts with sample and does a reduced number of samples, only works for training

00 - Meta/01 - Attachments/ECE1508 L3 - Generation by Explicit Distribution Learning 13.png
00 - Meta/01 - Attachments/ECE1508 L3 - Generation by Explicit Distribution Learning 15.png
00 - Meta/01 - Attachments/ECE1508 L3 - Generation by Explicit Distribution Learning 16.png
00 - Meta/01 - Attachments/ECE1508 L3 - Generation by Explicit Distribution Learning 17.png

Flow Based Models

Data Manifold: valid data is concentrated in a narrow lower dimension (manifold)
Latent Space: coordinate system or transformed version of data manifold
Latent Representation: mapping between data space and latent space

Normalizing Flow

  • for a random variable , where f is
    • invertible
    • strictly increasing or decreasing
  • vectorized form,
    • must be a jacobian matrix
    • should be differentiable and invertible,
    • to be invertible, z and x must be of the same dimension

Flow Learning

  • flow is typically done with deep models
    • each mapping must be invertible and differentiable

Sampling

  • latent distribution is easy to sample, eg. Gaussian
    00 - Meta/01 - Attachments/ECE1508 L3 - Generation by Explicit Distribution Learning 18.png

Training

  • maximize log-likelihood of normalized flow,
  • is MLE risk

00 - Meta/01 - Attachments/ECE1508 L3 - Generation by Explicit Distribution Learning 21.png
00 - Meta/01 - Attachments/ECE1508 L3 - Generation by Explicit Distribution Learning 19.png

Computational Flow Based Models

  • Real-valued Non-Volume Preserving (NVP)
    • used for visual generation
    • training is computationally easy,

00 - Meta/01 - Attachments/ECE1508 L3 - Generation by Explicit Distribution Learning 28.png
00 - Meta/01 - Attachments/ECE1508 L3 - Generation by Explicit Distribution Learning 22.png
00 - Meta/01 - Attachments/ECE1508 L3 - Generation by Explicit Distribution Learning 23.png
00 - Meta/01 - Attachments/ECE1508 L3 - Generation by Explicit Distribution Learning 24.png

  • Non-linear Independent Component Estimation (NICE)
    • simpler flow, easy to train, and not too expressive
  • Generative Flow (Glow)
    • 1x1 convolution to combine entries
    • training needs more computation
    • generation quality is better

flow models are easy to easy to sample

  • faster than AR models
  • faster and easier than EBMs
  • have high computational cost

00 - Meta/01 - Attachments/ECE1508 L3 - Generation by Explicit Distribution Learning 25.png