ECE1508 L1 - Text Generation via Language Models

tags:
  - ✏️Note
related:
  - "[[ECE1508 - Deep Generative Models]]"

ECE1508 L1 - Text Generation via Language Models.pdf

Language Model

Language Model is a learning model which can start from an incomplete text and complete it to a coherent text (correct correct semantics and syntax)

semantics: logically correct
syntax: grammatically correct

learning model ->

where w can be tuned

Corpus is the body of text used to a train model

Tokenization is the procedure of breaking text into smaller units called tokens

can be words, sub-words, or characters
text is represented as sequence of tokens, where I is vocabulary size
- by index:
- by one-hot encoding:

Embedding represents each token index in a vocabulary of size I by a fixed vector

let and , then gives
token vector goes from 1xI to Ex1, ie. a denser vector since E can be anything
E can be trained

Language Distribution describes the probability of a sequence of tokens

a language model completes a sequence of tokens it by sampling from a language distribution

iput:
dist:
oput:

Building a Language Model

Data: select and tokenize a corpus ( to )

Sample: randomly choose a sequence of size T ( to )
Label: set label to next set token(s)

Batch: take multiple samples to form batch

Maximum Likelihood Learning

Likelihood is the probability of model being correct
- over a batch likelihood is given by,
- working with log likelihood is easier
- likelihood can be maximized by minimizing cross-entropy,

Simple Language Models

Bigram Modell

token is used to predict
often gives wrong or repetitive outputs
very limited since there is no memory

00 - Meta/01 - Attachments/ECE1508 L1 - Text Generation via Language Models.png

Context Aware Model

context , sum of all tokens until , is used to predict
lacks sequential order

00 - Meta/01 - Attachments/ECE1508 L1 - Text Generation via Language Models 2.png

Recurrent Language Model

iput: and previous context
oput: new context and next token dist.
context is learned sequentially, very slow

00 - Meta/01 - Attachments/ECE1508 L1 - Text Generation via Language Models 4.png

Transformer-based Language Models

Transformers provide an efficient way to process sequences by using Attention, which splits each token into three embeddings

Attention Score , identifies the relevance of context to query

00 - Meta/01 - Attachments/ECE1508 L1 - Text Generation via Language Models 6.png

compute attention weights with softmax

Sequential Order is captured by adding positional encoding,

Parallel Processing

computation can be broken into T parallel processes
Masked Decoding is used to ensure same number of computations for each time by setting inner product to - for extra keys

00 - Meta/01 - Attachments/ECE1508 L1 - Text Generation via Language Models 7.png

00 - Meta/01 - Attachments/ECE1508 L1 - Text Generation via Language Models 12.png

Large Language Models

Perplexity

smaller perplexity = larger likelihood = better model

Pre-Training gives general language distribution
Fine-Tuning gives task-specific distribution

<str>, <delimiter>, and <end> tokens for sampling answers
supervised fine tuning: model is trained for additional epochs with samples of task specific distribution
full fine tuning: update all layers
selective fine tuning: update specific layers

Low Rank Adaptation (LoRA)

, fixed
and , learnable
updates vs. updates, LoRA is much lower

Prompt Design

Few-shot learning: performs task after showing few labeled examples
Zero-shot learning: performs task without any examples
Formulation (generic framework)
- encoding , turn task input into prompt for LM
- decoding , extract task output from LM completion
Foundation Model refers to a adaptable model