Sampling Multinomial Distribution
-
break interval [0,1] into levels proportional to
-
sample uniform variable
-
threshold
with levels -
need to compute
levels -
for data type of size D with C possible outcomes, complexity =
Maximum Likelihood Learning
Likelihood of model
- if i.i.d. dataset (
are independently sampled), then - train model by maximizing the log-likelihood function
Justification
-
intuitive:
- in a huge data-space, collected samples are most likely ones
- probability of these samples happening together is high
-
Kullback-Leibler Divergence
-
divergence=0 -> matched
-
divergent>0 -> unmatched
-
divergence is not symmetric,
-
can be estimated using the law of large numbers
-
-
Law of Large Numbers states that the sample mean will converge to the true mean as the number of trails approaches infinity
-
mathematical:
- maximizing likelihood is the same as minimizing KL divergence between the model and data distribution
Data Entropy is the inherit randomness of the data, which limits the error to non zero values
Autoregression
Model
-
chain-rule:
-
d dimensional object can be decomposed into product of 1-dimensional objects
-
for a trained model, all samples are of complexity
-
sampling complexity is linear
-
very slow since it must be done sequentially
Computational
- generally only use a single model,
- similar to LMs, sequence compute context and context computes distribution
Generation is done sequentially (autoregressive)
Training
-
teacher forcing: no need to feed output back in, rely on actual samples. this enables parallel processing
-
leads to expose bias, mismatch between input distributions during training vs. generation
-
resolve with scheduled sampling: gradually replace true samples with generated
-
or reinforcement learning: introduce reward system
-
Examples of AR Models
RNN-based AR Models
-
is sample from learned prior -
compute context for a given
with , and then predict -
iterate model over d entries
-
diagonal processing uses only pixels above and left of the new pixel
-
reduces iterations to 2d-1
-
model is different due to incorrect chain rule
Convolutional AR Models
- does not normally extract context
- transformers are used to extract context
- use masked filters to eliminate future entries
- assume
- assume
- use a deep convolution NN to get larger context area,
- each layer uses a smaller mask, effective receptive field expands
Transformer-based AR Models
Energy Based Models
Model
-
need a model that computes distribution at any sample
-
properties of distributions
-
Boltzmann distribution
- denominator is partition function
, which requires operations. and is need for both sampling and training EBMs
- denominator is partition function
-
Energy-based models use boltzmann constant to convert function
into distribution
Computational
- Boltzmann Machines: simple quadratic energy functions (single linear layer)
- Neural: deep NN as energy function (non-linear)
Training
-
empirical expectation:
-
can train EBM with MLE:
-
challenge is to sample data distribution
Sampling
-
Markov Chain Monte Carlo: each sample depends only on the previous, will converge to target
over time- zero order: use target distribution
directly - first order: use
- second order: use Hessian
- zero order: use target distribution
-
Gibbs Sampling
- burn-in period is time required for distribution to match target
Boltzmann Machine
Langevin Dynamics
-
samples from
with only energy function -
conservative sampling starts with sample
and does a reduced number of samples, only works for training
Flow Based Models
Data Manifold: valid data is concentrated in a narrow lower dimension (manifold)
Latent Space: coordinate system or transformed version of data manifold
Latent Representation: mapping between data space and latent space
Normalizing Flow
- for a random variable
, where f is- invertible
- strictly increasing or decreasing
- vectorized form,
must be a jacobian matrix should be differentiable and invertible,- to be invertible, z and x must be of the same dimension
Flow Learning
- flow is typically done with deep models
- each mapping must be invertible and differentiable
Sampling
- latent distribution is easy to sample, eg. Gaussian
Training
- maximize log-likelihood of normalized flow,
is MLE risk
Computational Flow Based Models
- Real-valued Non-Volume Preserving (NVP)
- used for visual generation
- training is computationally easy,
- Non-linear Independent Component Estimation (NICE)
- simpler flow, easy to train, and not too expressive
- Generative Flow (Glow)
- 1x1 convolution to combine entries
- training needs more computation
- generation quality is better
flow models are easy to easy to sample
- faster than AR models
- faster and easier than EBMs
- have high computational cost




















