Language Model
Language Model is a learning model which can start from an incomplete text and complete it to a coherent text (correct correct semantics and syntax)
- semantics: logically correct
- syntax: grammatically correct
Corpus is the body of text used to a train model
Tokenization is the procedure of breaking text into smaller units called tokens
- can be words, sub-words, or characters
- text is represented as sequence of tokens, where I is vocabulary size
- by index:
- by one-hot encoding:
- by index:
Embedding represents each token index in a vocabulary of size I by a fixed vector
- let
and , then gives - token vector goes from 1xI to Ex1, ie. a denser vector since E can be anything
- E can be trained
Language Distribution describes the probability of a sequence of tokens
- iput:
- dist:
- oput:
Building a Language Model
Data: select and tokenize a corpus (
Sample: randomly choose a sequence of size T (
Label: set label to next set token(s)
Batch: take multiple samples to form batch
Maximum Likelihood Learning
-
Likelihood is the probability of model being correct
-
over a batch likelihood is given by,
-
working with log likelihood is easier
-
likelihood can be maximized by minimizing cross-entropy,
-
Simple Language Models
Bigram Modell
- token
is used to predict - often gives wrong or repetitive outputs
- very limited since there is no memory
Context Aware Model
- context
, sum of all tokens until , is used to predict - lacks sequential order
Recurrent Language Model
- iput:
and previous context - oput: new context
and next token dist. - context is learned sequentially, very slow
Transformer-based Language Models
Transformers provide an efficient way to process sequences by using Attention, which splits each token into three embeddings
| Embedding | | Description |
| query |
| key |
| value |
Attention Score
compute attention weights with softmax
Sequential Order is captured by adding positional encoding,
Parallel Processing
computation can be broken into T parallel processes- Masked Decoding is used to ensure same number of computations for each time by setting inner product to -
for extra keys
Large Language Models
Perplexity
- smaller perplexity = larger likelihood = better model
Pre-Training gives general language distribution
Fine-Tuning gives task-specific distribution
- <str>, <delimiter>, and <end> tokens for sampling answers
- supervised fine tuning: model is trained for additional epochs with samples of task specific distribution
- full fine tuning: update all layers
- selective fine tuning: update specific layers
Low Rank Adaptation (LoRA)
, fixed and , learnable updates vs. updates, LoRA is much lower
Prompt Design
-
Few-shot learning: performs task after showing few labeled examples
-
Zero-shot learning: performs task without any examples
-
Formulation (generic framework)
- encoding
, turn task input into prompt for LM - decoding
, extract task output from LM completion
- encoding
-
Foundation Model refers to a adaptable model













