ECE1508 L3 - Generation by Explicit Distribution Learning

tags:
  - ✏️Note
related:
  - "[[ECE1508 - Deep Generative Models]]"

ECE1508 L3 - Generation by Explicit Distribution Learning.pdf

Sampling Multinomial Distribution

break interval [0,1] into levels proportional to
sample uniform variable
threshold with levels
need to compute levels
for data type of size D with C possible outcomes, complexity =

Maximum Likelihood Learning

Likelihood of model computed on is

if i.i.d. dataset ( are independently sampled), then
train model by maximizing the log-likelihood function

Justification

intuitive:
- in a huge data-space, collected samples are most likely ones
- probability of these samples happening together is high
Kullback-Leibler Divergence
- divergence=0 -> matched
- divergent>0 -> unmatched
- divergence is not symmetric,
- can be estimated using the law of large numbers
Law of Large Numbers states that the sample mean will converge to the true mean as the number of trails approaches infinity
mathematical:
- maximizing likelihood is the same as minimizing KL divergence between the model and data distribution

Data Entropy is the inherit randomness of the data, which limits the error to non zero values

Autoregression

00 - Meta/01 - Attachments/ECE1508 L3 - Generation by Explicit Distribution Learning.png

Model

chain-rule:
d dimensional object can be decomposed into product of 1-dimensional objects
for a trained model, all samples are of complexity
sampling complexity is linear
very slow since it must be done sequentially

Computational

generally only use a single model,
similar to LMs, sequence compute context and context computes distribution

Generation is done sequentially (autoregressive)

Training

Note

refers to data samples
refers to entries in sample

teacher forcing: no need to feed output back in, rely on actual samples. this enables parallel processing
- leads to expose bias, mismatch between input distributions during training vs. generation
- resolve with scheduled sampling: gradually replace true samples with generated
- or reinforcement learning: introduce reward system

Examples of AR Models

RNN-based AR Models

is sample from learned prior
compute context for a given with , and then predict
iterate model over d entries
diagonal processing uses only pixels above and left of the new pixel
reduces iterations to 2d-1
model is different due to incorrect chain rule

00 - Meta/01 - Attachments/ECE1508 L3 - Generation by Explicit Distribution Learning 5.png

Convolutional AR Models

does not normally extract context
transformers are used to extract context
use masked filters to eliminate future entries
- assume
use a deep convolution NN to get larger context area,
- each layer uses a smaller mask, effective receptive field expands

00 - Meta/01 - Attachments/ECE1508 L3 - Generation by Explicit Distribution Learning 2.png

Transformer-based AR Models

00 - Meta/01 - Attachments/ECE1508 L3 - Generation by Explicit Distribution Learning 4.png

Energy Based Models

Model

need a model that computes distribution at any sample
properties of distributions
Boltzmann distribution
- denominator is partition function , which requires operations. and is need for both sampling and training EBMs
Energy-based models use boltzmann constant to convert function into distribution

Computational

Boltzmann Machines: simple quadratic energy functions (single linear layer)
Neural: deep NN as energy function (non-linear)

Training

empirical expectation:
can train EBM with MLE:
challenge is to sample data distribution

Sampling

Markov Chain Monte Carlo: each sample depends only on the previous, will converge to target over time
- zero order: use target distribution directly
- first order: use
- second order: use Hessian
Gibbs Sampling
- burn-in period is time required for distribution to match target

Boltzmann Machine

00 - Meta/01 - Attachments/ECE1508 L3 - Generation by Explicit Distribution Learning 10.png

Langevin Dynamics

samples from with only energy function
conservative sampling starts with sample and does a reduced number of samples, only works for training

00 - Meta/01 - Attachments/ECE1508 L3 - Generation by Explicit Distribution Learning 13.png

Flow Based Models

Data Manifold: valid data is concentrated in a narrow lower dimension (manifold)
Latent Space: coordinate system or transformed version of data manifold
Latent Representation: mapping between data space and latent space

Normalizing Flow

for a random variable , where f is
- invertible
- strictly increasing or decreasing

vectorized form,
- must be a jacobian matrix
- should be differentiable and invertible,
- to be invertible, z and x must be of the same dimension

Flow Learning

flow is typically done with deep models
- each mapping must be invertible and differentiable

Sampling

latent distribution is easy to sample, eg. Gaussian

Training

maximize log-likelihood of normalized flow,
is MLE risk

00 - Meta/01 - Attachments/ECE1508 L3 - Generation by Explicit Distribution Learning 21.png

Computational Flow Based Models

Real-valued Non-Volume Preserving (NVP)
- used for visual generation
- training is computationally easy,

00 - Meta/01 - Attachments/ECE1508 L3 - Generation by Explicit Distribution Learning 28.png

Non-linear Independent Component Estimation (NICE)
- simpler flow, easy to train, and not too expressive

Generative Flow (Glow)
- 1x1 convolution to combine entries
- training needs more computation
- generation quality is better

flow models are easy to easy to sample

faster than AR models
faster and easier than EBMs
have high computational cost

00 - Meta/01 - Attachments/ECE1508 L3 - Generation by Explicit Distribution Learning 25.png