Language Model

Language Model is a learning model which can start from an incomplete text and complete it to a coherent text (correct correct semantics and syntax)

  • semantics: logically correct
  • syntax: grammatically correct
learning model -> where w can be tuned

Corpus is the body of text used to a train model

Tokenization is the procedure of breaking text into smaller units called tokens

  • can be words, sub-words, or characters
  • text is represented as sequence of tokens, where I is vocabulary size
    • by index:
    • by one-hot encoding:

Embedding represents each token index in a vocabulary of size I by a fixed vector

  • let and , then gives
  • token vector goes from 1xI to Ex1, ie. a denser vector since E can be anything
  • E can be trained

Language Distribution describes the probability of a sequence of tokens

a language model completes a sequence of tokens it by sampling from a language distribution
  • iput:
  • dist:
  • oput:

Building a Language Model

Data: select and tokenize a corpus ( to )

Sample: randomly choose a sequence of size T ( to )
Label: set label to next set token(s)

Batch: take multiple samples to form batch

Maximum Likelihood Learning

  • Likelihood is the probability of model being correct

    • over a batch likelihood is given by,

    • working with log likelihood is easier

    • likelihood can be maximized by minimizing cross-entropy,

Simple Language Models

Bigram Modell

  • token is used to predict
  • often gives wrong or repetitive outputs
  • very limited since there is no memory

00 - Meta/01 - Attachments/ECE1508 L1 - Text Generation via Language Models.png
00 - Meta/01 - Attachments/ECE1508 L1 - Text Generation via Language Models 1.png

Context Aware Model

  • context , sum of all tokens until , is used to predict
  • lacks sequential order

00 - Meta/01 - Attachments/ECE1508 L1 - Text Generation via Language Models 2.png
00 - Meta/01 - Attachments/ECE1508 L1 - Text Generation via Language Models 3.png

Recurrent Language Model

  • iput: and previous context
  • oput: new context and next token dist.
  • context is learned sequentially, very slow

00 - Meta/01 - Attachments/ECE1508 L1 - Text Generation via Language Models 4.png
00 - Meta/01 - Attachments/ECE1508 L1 - Text Generation via Language Models 5.png

Transformer-based Language Models

Transformers provide an efficient way to process sequences by using Attention, which splits each token into three embeddings

| Embedding | | Description |
| query | , | type of context desired |
| key | , | type of context offered |
| value | , | context of context |

Attention Score , identifies the relevance of context to query

00 - Meta/01 - Attachments/ECE1508 L1 - Text Generation via Language Models 6.png

compute attention weights with softmax

Sequential Order is captured by adding positional encoding,

Parallel Processing

  • computation can be broken into T parallel processes
  • Masked Decoding is used to ensure same number of computations for each time by setting inner product to - for extra keys

00 - Meta/01 - Attachments/ECE1508 L1 - Text Generation via Language Models 7.png
00 - Meta/01 - Attachments/ECE1508 L1 - Text Generation via Language Models 8.png
00 - Meta/01 - Attachments/ECE1508 L1 - Text Generation via Language Models 9.png
00 - Meta/01 - Attachments/ECE1508 L1 - Text Generation via Language Models 10.png
00 - Meta/01 - Attachments/ECE1508 L1 - Text Generation via Language Models 11.png

00 - Meta/01 - Attachments/ECE1508 L1 - Text Generation via Language Models 12.png
00 - Meta/01 - Attachments/ECE1508 L1 - Text Generation via Language Models 13.png

Large Language Models

Perplexity

  • smaller perplexity = larger likelihood = better model

Pre-Training gives general language distribution
Fine-Tuning gives task-specific distribution

  • <str>, <delimiter>, and <end> tokens for sampling answers
  • supervised fine tuning: model is trained for additional epochs with samples of task specific distribution
  • full fine tuning: update all layers
  • selective fine tuning: update specific layers

Low Rank Adaptation (LoRA)

  • , fixed
  • and , learnable
  • updates vs. updates, LoRA is much lower

Prompt Design

  • Few-shot learning: performs task after showing few labeled examples

  • Zero-shot learning: performs task without any examples

  • Formulation (generic framework)

    • encoding , turn task input into prompt for LM
    • decoding , extract task output from LM completion
  • Foundation Model refers to a adaptable model