EXTENDED AI CONCEPTS — VOLUME II

Memory, Reasoning
& Production

12 advanced concepts — Memory architectures, prompting strategies, agent planning, alignment, MoE, multimodal RAG, and LLMOps — all with live data-flow animations.

🧠 Memory Types
💬 CoT · ToT · ReAct
🗺️ Planning Strategies
🔄 Reflection & Reflexion
🔧 Tool Use Patterns
⚡ Inference Optimization
🔀 Model Routing
🕸️ Knowledge Graphs
⚖️ Alignment RLHF·DPO
🎛️ Mixture of Experts
🖼️ Multimodal RAG
🚀 LLMOps Pipeline

How These Concepts Connect

Each concept fills a gap in the complete AI system lifecycle — from how models reason, to how they're deployed and monitored in production.

01 / MEMORY

Memory Types

AI agents need different kinds of memory to operate effectively — from fleeting in-context tokens to persistent external knowledge stores. Choosing the right memory layer is critical for long-horizon tasks.

Sensory / In-Context Memory

The active context window — current conversation, retrieved chunks, tool outputs. Ephemeral: lost when context ends.

128K–1M tokens
~instant access
⚠️ limited size
📝

Working / Buffer Memory

Sliding window of last N messages. Summary buffer compresses old turns to save tokens while preserving context.

N messages
~fast access
📦 compressed
🗄️

Episodic / Semantic Memory

Vector DB stores past conversations and facts as embeddings. Retrieved via similarity search when relevant.

Millions of entries
~5–50ms lookup
🔍 fuzzy search
⚙️

Procedural Memory

Skills baked into model weights via fine-tuning. How to code, reason, format — implicit, always available.

Model weights
~0ms (implicit)
🔒 immutable at runtime
MEMORY FLOW IN AN AGENT SESSION
💬 User Message
🔧 Tool Result
📄 Retrieved Doc
⚡ Context Window
Active Memory
📝 Buffer Summary
Compressed History
🗄️ Vector Store
Long-Term Episodic
🕸️ Knowledge Graph
Structured Facts
⚙️ Model Weights
Procedural Skills
🗜️
Summarization Memory
Progressively summarize old turns with an LLM call. Keeps token count bounded while preserving key facts from long conversations.
ConversationSummaryBufferMemory(max_token_limit=2000)
🏷️
Entity Memory
Extract named entities from dialogue and maintain a persistent entity store. Agent always knows "who is Sarah" across sessions.
entities = {{"Sarah": "CEO of Acme, prefers brevity"}}
🔑
Knowledge Graph Memory
Store facts as (subject, relation, object) triples. Enables structured recall: "what did user prefer last month?"
(user) -[prefers]-> (dark mode UI)
02 / PROMPTING

Prompting Techniques

How you prompt an LLM dramatically changes the quality of its reasoning. These techniques elicit structured thinking, multi-path exploration, and grounded action from the same base model.

CHAIN-OF-THOUGHT — Step-by-step reasoning before answering
USERA store sells 3 types of fruit. Apples cost $1.20, Bananas $0.50, Cherries $3.00. I buy 4 apples, 6 bananas, 2 cherries. What's my total?
ASSISTANT Let me work through this step by step:
Step 1: Apples → 4 × $1.20 = $4.80
Step 2: Bananas → 6 × $0.50 = $3.00
Step 3: Cherries → 2 × $3.00 = $6.00
Step 4: Total → $4.80 + $3.00 + $6.00 = $13.80
VARIANTS
Zero-Shot CoT
Append "Let's think step by step" — no examples needed. Works on many models out-of-the-box.
Few-Shot CoT
Provide 2–8 exemplar (question + reasoning + answer) pairs before the target question.
Auto-CoT
Automatically generate diverse exemplars with zero-shot CoT, cluster by topic, select representatives.
TREE OF THOUGHTS — Explore multiple reasoning branches, evaluate, select best
🤔 Problem: Write a creative story opening
Branch A:
Start in media res
⭐ Score: 8.5/10
Selected!
Branch B:
Start with dialogue
Score: 7/10
Pruned
Branch C:
Start with setting
Score: 5/10
Abandoned
ToT generates N candidate continuations at each step, scores them with an evaluator (LLM or heuristic), and uses BFS/DFS/beam search to explore the best path. Dramatically improves performance on planning and creative tasks.
ReAct — Interleave Reasoning + Acting in structured loop
THOUGHT 1:I need to find the population of Tokyo to answer this question.
ACTION 1:Search["Tokyo population 2024"]
OBSERVE 1:Tokyo metropolitan area population is approximately 37.4 million as of 2024.
THOUGHT 2:Now I have the population. I need to calculate the density using the area (2,194 km²).
ACTION 2:Calculator[37400000 / 2194]
OBSERVE 2:≈ 17,047 people/km²
ANSWER:Tokyo's population density is approximately 17,047 people per km².
FEW-SHOT vs ZERO-SHOT — In-context learning
Zero-Shot
No examples provided. Model relies entirely on pre-training knowledge. Works for simple tasks.
Classify sentiment: "The movie was decent" → ?
Few-Shot (3-shot)
Provide 2–8 (input, output) pairs. Model learns the pattern in-context. No gradient updates needed.
"I loved it" → Positive
"Terrible film" → Negative
"It was okay" → Neutral
"The movie was decent" → ?
Best practices: diverse examples, consistent format, examples that cover edge cases. Order matters — put most relevant last.
SELF-CONSISTENCY — Sample multiple CoT paths, vote on final answer
Path 1 → $13.80 ✓
Temperature=0.7 sample. Calculated each item then summed.
Path 2 → $13.80 ✓
Different reasoning order, same answer. Adds to majority vote.
Path 3 → $14.30 ✗
Arithmetic error in cherries. Minority vote — discarded.
✅ Final Answer: $13.80 (majority vote, 2/3 paths agree)
03 / PLANNING

Agent Planning Strategies

Different planning architectures determine how agents decompose goals, sequence tool calls, and recover from failures. Choosing the right strategy depends on task complexity and latency tolerance.

ReAct
interleaved reasoning+acting
Think → Act → Observe → repeat
No upfront planning — reactive
Simple, fast, works well in practice
Default for most agents
📋
Plan-and-Execute
plan first, then execute
Phase 1: Planner LLM creates full plan
Phase 2: Executor runs each step
Can re-plan if step fails
Complex multi-step tasks
🔀
LLM Compiler
parallel tool execution
Creates DAG of tool dependencies
Independent tools run in parallel
3–5× faster than sequential
Speed-critical pipelines
🌳
MRKL / Router
modular reasoning + knowledge
Routes query to expert module
Calculator, Search, SQL, Code
Each module is specialized
Mixed-domain tasks
PLAN-AND-EXECUTE — DETAILED DATA FLOW
💭
Planner LLM

Generates ordered list of sub-tasks as a structured plan

📋
Task Queue

Ordered list: [search, analyze, write, review]

Executor Agent

Picks next task, selects tool, runs action loop

Task Result

Stored in memory; passed to next task as context

🔄
Re-Planner

If step fails or result changes scope, replans remaining steps

LLM COMPILER — PARALLEL TOOL DAG
📝 User Task
🔍 Web Search
parallel
🗄️ DB Query
parallel
📊 API Call
parallel
⏳ Await all results
✅ Synthesize → Answer
Sequential: 9s → Parallel: 3s (3× speedup)
04 / REFLECTION

Reflection & Self-Correction

Agents can critique their own outputs, identify errors, and iteratively improve — without human feedback. Reflexion stores verbal reflections as episodic memory to avoid repeating mistakes.

SELF-CORRECTION LOOP
📝 Initial Response
🔍 Self-Critique
✏️ Revise
✅ Final Output
[CRITIQUE] The previous response incorrectly claimed that Python is statically typed. Python uses dynamic typing. The answer also missed mentioning type hints added in Python 3.5+.
[REVISION] Python is a dynamically typed language — variable types are determined at runtime. Since Python 3.5, optional type hints allow static type checkers like mypy to verify code without changing runtime behavior.

Actor

Takes actions in the environment based on task, memory, and feedback. Generates trajectory of actions.

  • Uses current policy
  • ReAct-style execution
  • Records full trajectory

Evaluator

Scores the Actor's trajectory against task success criteria. Binary or scalar reward signal.

  • Heuristic or LLM judge
  • Detects task failure
  • Triggers reflection when failing

Self-Reflector

Generates verbal reflection on why the attempt failed and how to improve. Stored in long-term memory.

  • Natural language memory
  • Persists across episodes
  • Injected into next attempt
CONSTITUTIONAL AI — PRINCIPLE-GUIDED SELF-CRITIQUE
1. Generate
Model generates initial response to potentially harmful prompt (red-teaming phase).
2. Critique
Model critiques its own output against constitutional principles: "Does this respect human rights?"
3. Revise
Model rewrites response to comply with principles. Cycle repeats until all principles satisfied.
05 / TOOL USE

Tool Use Patterns

Modern LLMs invoke external tools via structured function calls — extending their capabilities from pure text generation to real-world actions: web search, code execution, database queries, and computer control.

TOOL DEFINITION SCHEMA (OpenAI / Anthropic format)
{
  "name": "web_search",
  "description": "Search the web for current information",
  "parameters": {
    "query": {"type": "string", "description": "search terms"},
    "max_results": {"type": "integer", "default": 5},
    "safe_search": {"type": "boolean", "default": true}
  },
  "required": ["query"]
}
FUNCTION CALLING — FULL CONVERSATION FLOW
USER →"What's the current stock price of NVIDIA?"
LLM decides →I need current data. I'll call the stock_price tool.
TOOL CALL →stock_price(symbol="NVDA", currency="USD")
TOOL RESULT →{{"symbol":"NVDA","price":875.40,"change":"+2.3%","timestamp":"2024-11-15T14:32:00Z"}}
ANSWER →NVIDIA (NVDA) is currently trading at $875.40, up 2.3% today as of 2:32 PM UTC.
🌐
Web Search
Real-time web search. Grounds LLM in current events. Often the highest-value tool.
Tavily · Bing · SerpAPI
💻
Code Execution
Run Python/JS in sandboxed environment. Enables math, data analysis, file manipulation.
E2B · Docker sandbox
🗃️
Database Query
Convert NL to SQL, execute against real DB, return structured results.
Text-to-SQL · BIRD bench
🖥️
Computer Use
Control GUIs, click buttons, fill forms, navigate browsers. Anthropic Claude computer use.
Claude CUA · Playwright
📧
API Integrations
Call REST APIs for email, calendar, CRM, payment, communication services.
MCP · OpenAPI spec
📁
File Operations
Read, write, parse files — PDFs, CSVs, images. Foundation of document workflows.
LlamaIndex · Unstructured
06 / INFERENCE OPT

LLM Inference Optimization

Serving large models at scale requires aggressive optimization. These techniques reduce latency and cost without sacrificing quality — essential for production deployments at thousands of QPS.

🗜️
Quantization
Reduce weight precision: FP32 → FP16 → INT8 → INT4. Smaller model fits in less VRAM, runs faster.
Memory savings:INT4 = 8×
KV Cache
Cache key/value attention states from prompt tokens. Prefill computed once; decode reuses cache. Massive speedup for long contexts.
Decode speedup:10–100×
🔮
Speculative Decoding
Small draft model generates k tokens in parallel; large model verifies in one forward pass. Accepted tokens = free speedup.
Throughput boost:2–4×
📦
Continuous Batching
Dynamically add new requests mid-batch as others finish. Eliminates idle GPU time from fixed-batch padding waste.
GPU utilization:+40–60%
📄
PagedAttention
vLLM's virtual memory for KV cache. Eliminates fragmentation, enables 24× more concurrent requests vs HuggingFace.
Concurrent reqs:24× more
💾
Prompt Caching
Cache repeated system prompts across requests (Anthropic, OpenAI). Eliminates redundant prefill computation.
Cost reduction:up to 90%
QUANTIZATION COMPARISON — 70B MODEL
Format
VRAM
Quality
Speed
FP32 (full)
280 GB
FP16 / BF16
140 GB
1.8×
INT8 (GPTQ)
70 GB
2.5×
INT4 (QLoRA)
35 GB
3.8×
07 / ROUTING

Model Routing & Cascading

Not every query needs GPT-4. A smart router classifies query complexity and directs it to the cheapest model that can handle it — slashing costs by 60–90% with minimal quality loss.

💬 Incoming Query
🔀 Query Classifier / Router
classifies: complexity · topic · risk · cost budget
↓ routes to →
Simple Query

FAQ, classification, summarization, extraction

Haiku · Gemini Flash
~$0.001 / 1K tokens
Medium Query

Multi-step reasoning, code generation, analysis

Sonnet · GPT-4o-mini
~$0.003 / 1K tokens
Complex Query

Deep research, long-form, specialized expertise

Opus · GPT-4o
~$0.015 / 1K tokens
Cached Query

Exact match or semantic duplicate of past query

Cache Hit → Return
~$0.000 (free!)
COST BREAKDOWN — BEFORE vs AFTER ROUTING
Before (all GPT-4o)
$100/day
After (smart routing)
$22/day
With caching too
$8/day
08 / KNOWLEDGE

Knowledge Graphs

Structured knowledge stored as (entity, relation, entity) triples. Enables multi-hop reasoning across connections that vector search cannot traverse. Graph RAG combines both for maximum coverage.

CEO_of founded works_at acquires located_in
👤
🏢
🚀
👤
🔷
📍

🕸️ Knowledge Graph

  • Explicit, typed relationships between entities
  • Multi-hop traversal: "Who invested in OpenAI's parent company?"
  • Precise, no hallucination on stored facts
  • Hard to build — requires entity extraction + linking
  • Best for: compliance, legal, biomedical, enterprise knowledge

🗄️ Vector Store

  • Implicit semantic similarity — no typed relations
  • Single-hop: "documents similar to this query"
  • Works on unstructured text out of the box
  • Easy to build — just embed and index
  • Best for: open-domain Q&A, document search, general RAG
GRAPH RAG — MICROSOFT RESEARCH APPROACH
📄 Documents
🔍 Entity Extract
🕸️ Build Graph
📊 Community Detect
✅ Global Answer
Graph RAG detects communities of related entities, summarizes each community, then uses both local (vector) and global (graph community) context. Outperforms vanilla RAG on queries requiring understanding across entire corpora.
09 / ALIGNMENT

Alignment Techniques

Raw pre-trained models follow statistical patterns, not human values. Alignment techniques teach models to be helpful, harmless, and honest — transforming a text predictor into a trustworthy assistant.

🏆 RLHF

Reinforcement Learning from Human Feedback. Humans rank model outputs → train reward model → optimize policy with PPO.

✓ Gold standard quality · battle-tested (GPT-4, Claude)
✗ Expensive human labeling · unstable PPO training

⚡ DPO

Direct Preference Optimization. Skips reward model entirely — fine-tunes directly on (preferred, rejected) pairs.

✓ 3× simpler · stable · no RL needed · cheaper
✗ Slightly lower peak quality · needs good preference data

🤖 RLAIF

RL from AI Feedback. Replace human raters with a strong LLM judge (Claude Opus). Scales annotation cheaply.

✓ Scalable · consistent · cheap · fast iteration
✗ AI biases propagate · needs strong judge model

📋 SFT

Supervised Fine-Tuning on instruction-following demonstrations. First step in alignment pipeline — teaches format.

✓ Simple · fast · prerequisite for RLHF
✗ Mimics behavior without understanding preferences
RLHF PIPELINE — COMPLETE FLOW
📚 Pre-trainNext-token predict on internet text
✍️ SFTFine-tune on human demonstrations
👥 Human RankHumans rank model outputs A > B > C
🏆 Reward ModelTrained to predict human preference
🎮 PPOPolicy optimized against reward model
✅ Aligned LLMHelpful · Harmless · Honest
10 / ARCHITECTURE

Mixture of Experts (MoE)

Instead of activating all parameters for every token, MoE routes each token to a small subset of "expert" FFN layers. Same quality as a dense model at a fraction of the compute cost per token.

📥 Input Token: "Translate to French:"
🔀 Gating Network
Softmax over expert scores → select Top-K (e.g. K=2)
↓ routes to 2 of 8 experts
🇫🇷Language
Expert
🧮Math
Expert
📝Writing
Expert
💻Code
Expert
🔬Science
Expert
📊Data
Expert
⚖️Legal
Expert
🏥Medical
Expert
↓ weighted sum of expert outputs
✅ "Traduire en français:" (output token)
8×7B
Mixtral: 8 experts, 7B each
Top-2
Experts activated per token
12.5B
Active params (of 46.7B total)
3–5×
Compute saving vs dense
Load Balancing Loss
Auxiliary loss term penalizes routing collapse (all tokens going to 1 expert). Ensures uniform expert utilization.
L_balance = α · Σ f_i · P_i
Expert Parallelism
Different experts sharded across GPUs. Each GPU hosts a subset. All-to-all communication routes tokens to the right GPU.
expert_i → GPU_i → all-to-all scatter
11 / MULTIMODAL

Multimodal RAG

Extend RAG beyond text to retrieve and reason over images, charts, tables, audio, and video. Vision-language models can now index and query visual documents without OCR.

📄
Text
Standard embedding + vector search
OpenAI · Cohere
🖼️
Images
CLIP embeddings · ColPali visual patches
CLIP · ColPali · LLaVA
📊
Tables/Charts
Screenshot → VLM describes → embed description
GPT-4V · Gemini
🎵
Audio
Whisper transcription → text RAG pipeline
Whisper · Deepgram
🎬
Video
Frame sampling + audio → multimodal index
Gemini 1.5 · LLaVA-Video
📐
PDFs/Slides
ColPali: index PDF pages as images directly
ColPali · Unstructured
COLPALI — VISUAL DOCUMENT RETRIEVAL (No OCR!)
📄 PDF Page
(as image)
🔲 Patch
Embeddings
32×32 patches
🗄️ Index
in Vector DB
per-patch vectors
💬 Query
Embedding
text → patches
🏆 MaxSim
Score
best patch match
✅ Top-K
Pages
→ VLM
12 / PRODUCTION

LLMOps Pipeline

Taking LLMs from prototype to reliable production requires systematic prompt versioning, regression testing, shadow deployments, A/B experiments, and continuous feedback loops — just like traditional MLOps.

CI/CD FOR LLM APPLICATIONS — FULL LIFECYCLE
✍️
Prompt Dev
Version control prompts in git. A/B template variants. Prompt registry.
🧪
Eval Suite
Run against golden dataset. RAGAS + LLM-judge. Regression gate.
👥
Human Review
Expert annotators flag edge cases. Capture hard examples for suite.
🕵️
Shadow Deploy
New prompt runs in parallel. Compare vs production without serving to users.
🔀
A/B Test
5% traffic to new variant. Monitor quality, latency, cost. Statistical significance.
🚀
Full Rollout
Canary → 10% → 50% → 100%. Auto-rollback on metric degradation.
📊
Monitor
Track drift, quality, cost, latency. User feedback loop back to eval suite.
Prompt Versioning

Treat prompts like code. Store in git with semantic versioning.

  • PromptLayer · LangSmith
  • Diff comparison between versions
  • Rollback in seconds
Eval-Driven Dev

Never ship a prompt change without running the eval suite first.

  • Golden Q&A dataset (100+ examples)
  • Automated scoring with RAGAS
  • Block merge if score drops >2%
Cost Tracking

Per-user, per-feature, per-model cost attribution.

  • Token usage by model tier
  • Cache hit rate tracking
  • Cost per successful completion
Data Flywheel

Production traffic becomes tomorrow's training data.

  • Thumbs up/down captures preference
  • Hard examples → eval suite
  • Periodic fine-tune on collected data
🔭
LangSmith
Tracing, evals, datasets, prompt hub for LangChain apps.
📡
Helicone
LLM observability: cost, latency, caching proxy — one line of code.
🧠
Weights & Biases
Experiment tracking for fine-tuning runs. Prompt management.
🔬
Arize Phoenix
Open-source LLM observability. Spans, evals, drift detection.

Complete Concept Map

All 12 extended concepts and where they fit in the AI system stack

Concept
Layer
Core Mechanism
Key Benefit
🧠 Memory Types
State Management
In-context + vector + KG + weights
Long-horizon continuity
💬 CoT / ToT / ReAct
Reasoning
Structured thought traces + branching
+20–40% accuracy on hard tasks
🗺️ Planning Strategies
Agent Control
Sequential / parallel / DAG execution
3–5× task speed, recovery from failure
🔄 Reflection
Self-Improvement
Critique → revise loop
Higher quality without extra data
🔧 Tool Use
Capability Extension
JSON function call → external action
Real-world grounding, current data
⚡ Inference Opt
Infrastructure
Quant · KV cache · speculative decode
3–24× speed, 8× memory reduction
🔀 Model Routing
Cost Optimization
Classifier routes to cheapest capable model
60–90% cost reduction
🕸️ Knowledge Graphs
Structured Knowledge
(entity, relation, entity) triples + traversal
Multi-hop reasoning, no hallucination
⚖️ Alignment
Safety & Values
RLHF · DPO · Constitutional AI
Helpful, harmless, honest behavior
🎛️ Mixture of Experts
Model Architecture
Sparse routing to Top-K FFN experts
3–5× compute efficiency
🖼️ Multimodal RAG
Data Modalities
CLIP/ColPali for visual retrieval
Index PDFs/images without OCR
🚀 LLMOps
Production
CI/CD · evals · shadow deploy · flywheel
Reliable, improving production systems