AI Concepts — Extended Visual Guide

01 / MEMORY

Memory Types

AI agents need different kinds of memory to operate effectively — from fleeting in-context tokens to persistent external knowledge stores. Choosing the right memory layer is critical for long-horizon tasks.

⚡

Sensory / In-Context Memory

The active context window — current conversation, retrieved chunks, tool outputs. Ephemeral: lost when context ends.

128K–1M tokens
~instant access
⚠️ limited size

📝

Working / Buffer Memory

Sliding window of last N messages. Summary buffer compresses old turns to save tokens while preserving context.

N messages
~fast access
📦 compressed

🗄️

Episodic / Semantic Memory

Vector DB stores past conversations and facts as embeddings. Retrieved via similarity search when relevant.

Millions of entries
~5–50ms lookup
🔍 fuzzy search

⚙️

Procedural Memory

Skills baked into model weights via fine-tuning. How to code, reason, format — implicit, always available.

Model weights
~0ms (implicit)
🔒 immutable at runtime

MEMORY FLOW IN AN AGENT SESSION

💬 User Message

🔧 Tool Result

📄 Retrieved Doc

→

⚡ Context Window
Active Memory

📝 Buffer Summary
Compressed History

↔

🗄️ Vector Store
Long-Term Episodic

🕸️ Knowledge Graph
Structured Facts

⚙️ Model Weights
Procedural Skills

🗜️

Summarization Memory

Progressively summarize old turns with an LLM call. Keeps token count bounded while preserving key facts from long conversations.

ConversationSummaryBufferMemory(max_token_limit=2000)

🏷️

Entity Memory

Extract named entities from dialogue and maintain a persistent entity store. Agent always knows "who is Sarah" across sessions.

entities = {{"Sarah": "CEO of Acme, prefers brevity"}}

🔑

Knowledge Graph Memory

Store facts as (subject, relation, object) triples. Enables structured recall: "what did user prefer last month?"

(user) -[prefers]-> (dark mode UI)

02 / PROMPTING

Prompting Techniques

How you prompt an LLM dramatically changes the quality of its reasoning. These techniques elicit structured thinking, multi-path exploration, and grounded action from the same base model.

CHAIN-OF-THOUGHT — Step-by-step reasoning before answering

USERA store sells 3 types of fruit. Apples cost $1.20, Bananas $0.50, Cherries $3.00. I buy 4 apples, 6 bananas, 2 cherries. What's my total?

ASSISTANT Let me work through this step by step:
Step 1: Apples → 4 × $1.20 = $4.80
Step 2: Bananas → 6 × $0.50 = $3.00
Step 3: Cherries → 2 × $3.00 = $6.00
Step 4: Total → $4.80 + $3.00 + $6.00 = $13.80

VARIANTS

Zero-Shot CoT

Append "Let's think step by step" — no examples needed. Works on many models out-of-the-box.

Few-Shot CoT

Provide 2–8 exemplar (question + reasoning + answer) pairs before the target question.

Auto-CoT

Automatically generate diverse exemplars with zero-shot CoT, cluster by topic, select representatives.

TREE OF THOUGHTS — Explore multiple reasoning branches, evaluate, select best

🤔 Problem: Write a creative story opening

Branch A:
Start in media res

⭐ Score: 8.5/10
Selected!

Branch B:
Start with dialogue

Score: 7/10
Pruned

Branch C:
Start with setting

Score: 5/10
Abandoned

ToT generates N candidate continuations at each step, scores them with an evaluator (LLM or heuristic), and uses BFS/DFS/beam search to explore the best path. Dramatically improves performance on planning and creative tasks.

ReAct — Interleave Reasoning + Acting in structured loop

THOUGHT 1:I need to find the population of Tokyo to answer this question.

ACTION 1:Search["Tokyo population 2024"]

OBSERVE 1:Tokyo metropolitan area population is approximately 37.4 million as of 2024.

THOUGHT 2:Now I have the population. I need to calculate the density using the area (2,194 km²).

ACTION 2:Calculator[37400000 / 2194]

OBSERVE 2:≈ 17,047 people/km²

ANSWER:Tokyo's population density is approximately 17,047 people per km².

FEW-SHOT vs ZERO-SHOT — In-context learning

Zero-Shot

No examples provided. Model relies entirely on pre-training knowledge. Works for simple tasks.

Classify sentiment: "The movie was decent" → ?

Few-Shot (3-shot)

Provide 2–8 (input, output) pairs. Model learns the pattern in-context. No gradient updates needed.

"I loved it" → Positive
"Terrible film" → Negative
"It was okay" → Neutral
"The movie was decent" → ?

Best practices: diverse examples, consistent format, examples that cover edge cases. Order matters — put most relevant last.

SELF-CONSISTENCY — Sample multiple CoT paths, vote on final answer

Path 1 → $13.80 ✓

Temperature=0.7 sample. Calculated each item then summed.

Path 2 → $13.80 ✓

Different reasoning order, same answer. Adds to majority vote.

Path 3 → $14.30 ✗

Arithmetic error in cherries. Minority vote — discarded.

✅ Final Answer: $13.80 (majority vote, 2/3 paths agree)

03 / PLANNING

Agent Planning Strategies

Different planning architectures determine how agents decompose goals, sequence tool calls, and recover from failures. Choosing the right strategy depends on task complexity and latency tolerance.

⚡

ReAct

interleaved reasoning+acting

Think → Act → Observe → repeat

No upfront planning — reactive

Simple, fast, works well in practice

Default for most agents

📋

Plan-and-Execute

plan first, then execute

Phase 1: Planner LLM creates full plan

Phase 2: Executor runs each step

Can re-plan if step fails

Complex multi-step tasks

🔀

LLM Compiler

parallel tool execution

Creates DAG of tool dependencies

Independent tools run in parallel

3–5× faster than sequential

Speed-critical pipelines

🌳

MRKL / Router

modular reasoning + knowledge

Routes query to expert module

Calculator, Search, SQL, Code

Each module is specialized

Mixed-domain tasks

PLAN-AND-EXECUTE — DETAILED DATA FLOW

💭

Planner LLM

Generates ordered list of sub-tasks as a structured plan

📋

Task Queue

Ordered list: [search, analyze, write, review]

⚡

Executor Agent

Picks next task, selects tool, runs action loop

✅

Task Result

Stored in memory; passed to next task as context

🔄

Re-Planner

If step fails or result changes scope, replans remaining steps

LLM COMPILER — PARALLEL TOOL DAG

📝 User Task

🔍 Web Search
parallel

🗄️ DB Query
parallel

📊 API Call
parallel

⏳ Await all results

✅ Synthesize → Answer

Sequential: 9s → Parallel: 3s (3× speedup)

04 / REFLECTION

Reflection & Self-Correction

Agents can critique their own outputs, identify errors, and iteratively improve — without human feedback. Reflexion stores verbal reflections as episodic memory to avoid repeating mistakes.

SELF-CORRECTION LOOP

📝 Initial Response

→

🔍 Self-Critique

→

✏️ Revise

→

✅ Final Output

[CRITIQUE] The previous response incorrectly claimed that Python is statically typed. Python uses dynamic typing. The answer also missed mentioning type hints added in Python 3.5+.
[REVISION] Python is a dynamically typed language — variable types are determined at runtime. Since Python 3.5, optional type hints allow static type checkers like mypy to verify code without changing runtime behavior.

Actor

Takes actions in the environment based on task, memory, and feedback. Generates trajectory of actions.

Uses current policy
ReAct-style execution
Records full trajectory

Evaluator

Scores the Actor's trajectory against task success criteria. Binary or scalar reward signal.

Heuristic or LLM judge
Detects task failure
Triggers reflection when failing

Self-Reflector

Generates verbal reflection on why the attempt failed and how to improve. Stored in long-term memory.

Natural language memory
Persists across episodes
Injected into next attempt

CONSTITUTIONAL AI — PRINCIPLE-GUIDED SELF-CRITIQUE

1. Generate

Model generates initial response to potentially harmful prompt (red-teaming phase).

2. Critique

Model critiques its own output against constitutional principles: "Does this respect human rights?"

3. Revise

Model rewrites response to comply with principles. Cycle repeats until all principles satisfied.

05 / TOOL USE

Tool Use Patterns

Modern LLMs invoke external tools via structured function calls — extending their capabilities from pure text generation to real-world actions: web search, code execution, database queries, and computer control.

TOOL DEFINITION SCHEMA (OpenAI / Anthropic format)

{
  "name": "web_search",
  "description": "Search the web for current information",
  "parameters": {
    "query": {"type": "string", "description": "search terms"},
    "max_results": {"type": "integer", "default": 5},
    "safe_search": {"type": "boolean", "default": true}
  },
  "required": ["query"]
}

FUNCTION CALLING — FULL CONVERSATION FLOW

USER →"What's the current stock price of NVIDIA?"

LLM decides →I need current data. I'll call the stock_price tool.

TOOL CALL →stock_price(symbol="NVDA", currency="USD")

TOOL RESULT →{{"symbol":"NVDA","price":875.40,"change":"+2.3%","timestamp":"2024-11-15T14:32:00Z"}}

ANSWER →NVIDIA (NVDA) is currently trading at $875.40, up 2.3% today as of 2:32 PM UTC.

🌐

Web Search

Real-time web search. Grounds LLM in current events. Often the highest-value tool.

Tavily · Bing · SerpAPI

💻

Code Execution

Run Python/JS in sandboxed environment. Enables math, data analysis, file manipulation.

E2B · Docker sandbox

🗃️

Database Query

Convert NL to SQL, execute against real DB, return structured results.

Text-to-SQL · BIRD bench

🖥️

Computer Use

Control GUIs, click buttons, fill forms, navigate browsers. Anthropic Claude computer use.

Claude CUA · Playwright

📧

API Integrations

Call REST APIs for email, calendar, CRM, payment, communication services.

MCP · OpenAPI spec

📁

File Operations

Read, write, parse files — PDFs, CSVs, images. Foundation of document workflows.

LlamaIndex · Unstructured

06 / INFERENCE OPT

LLM Inference Optimization

Serving large models at scale requires aggressive optimization. These techniques reduce latency and cost without sacrificing quality — essential for production deployments at thousands of QPS.

🗜️

Quantization

Reduce weight precision: FP32 → FP16 → INT8 → INT4. Smaller model fits in less VRAM, runs faster.

Memory savings:INT4 = 8×

⚡

KV Cache

Cache key/value attention states from prompt tokens. Prefill computed once; decode reuses cache. Massive speedup for long contexts.

Decode speedup:10–100×

🔮

Speculative Decoding

Small draft model generates k tokens in parallel; large model verifies in one forward pass. Accepted tokens = free speedup.

Throughput boost:2–4×

📦

Continuous Batching

Dynamically add new requests mid-batch as others finish. Eliminates idle GPU time from fixed-batch padding waste.

GPU utilization:+40–60%

📄

PagedAttention

vLLM's virtual memory for KV cache. Eliminates fragmentation, enables 24× more concurrent requests vs HuggingFace.

Concurrent reqs:24× more

💾

Prompt Caching

Cache repeated system prompts across requests (Anthropic, OpenAI). Eliminates redundant prefill computation.

Cost reduction:up to 90%

QUANTIZATION COMPARISON — 70B MODEL

Format

VRAM

Quality

Speed

FP32 (full)

280 GB

1×

FP16 / BF16

140 GB

1.8×

INT8 (GPTQ)

70 GB

2.5×

INT4 (QLoRA)

35 GB

3.8×

07 / ROUTING

Model Routing & Cascading

Not every query needs GPT-4. A smart router classifies query complexity and directs it to the cheapest model that can handle it — slashing costs by 60–90% with minimal quality loss.

💬 Incoming Query

↓

🔀 Query Classifier / Router
classifies: complexity · topic · risk · cost budget

↓ routes to →

Simple Query

FAQ, classification, summarization, extraction

Haiku · Gemini Flash

~$0.001 / 1K tokens

Medium Query

Multi-step reasoning, code generation, analysis

Sonnet · GPT-4o-mini

~$0.003 / 1K tokens

Complex Query

Deep research, long-form, specialized expertise

Opus · GPT-4o

~$0.015 / 1K tokens

Cached Query

Exact match or semantic duplicate of past query

Cache Hit → Return

~$0.000 (free!)

COST BREAKDOWN — BEFORE vs AFTER ROUTING

Before (all GPT-4o)

$100/day

After (smart routing)

$22/day

With caching too

$8/day

08 / KNOWLEDGE

Knowledge Graphs

Structured knowledge stored as (entity, relation, entity) triples. Enables multi-hop reasoning across connections that vector search cannot traverse. Graph RAG combines both for maximum coverage.

👤

🏢

🚀

👤

🔷

📍

🕸️ Knowledge Graph

Explicit, typed relationships between entities
Multi-hop traversal: "Who invested in OpenAI's parent company?"
Precise, no hallucination on stored facts
Hard to build — requires entity extraction + linking
Best for: compliance, legal, biomedical, enterprise knowledge

🗄️ Vector Store

Implicit semantic similarity — no typed relations
Single-hop: "documents similar to this query"
Works on unstructured text out of the box
Easy to build — just embed and index
Best for: open-domain Q&A, document search, general RAG

GRAPH RAG — MICROSOFT RESEARCH APPROACH

📄 Documents

→

🔍 Entity Extract

→

🕸️ Build Graph

→

📊 Community Detect

→

✅ Global Answer

Graph RAG detects communities of related entities, summarizes each community, then uses both local (vector) and global (graph community) context. Outperforms vanilla RAG on queries requiring understanding across entire corpora.

09 / ALIGNMENT

Alignment Techniques

Raw pre-trained models follow statistical patterns, not human values. Alignment techniques teach models to be helpful, harmless, and honest — transforming a text predictor into a trustworthy assistant.

🏆 RLHF

Reinforcement Learning from Human Feedback. Humans rank model outputs → train reward model → optimize policy with PPO.

✓ Gold standard quality · battle-tested (GPT-4, Claude)

✗ Expensive human labeling · unstable PPO training

⚡ DPO

Direct Preference Optimization. Skips reward model entirely — fine-tunes directly on (preferred, rejected) pairs.

✓ 3× simpler · stable · no RL needed · cheaper

✗ Slightly lower peak quality · needs good preference data

🤖 RLAIF

RL from AI Feedback. Replace human raters with a strong LLM judge (Claude Opus). Scales annotation cheaply.

✓ Scalable · consistent · cheap · fast iteration

✗ AI biases propagate · needs strong judge model

📋 SFT

Supervised Fine-Tuning on instruction-following demonstrations. First step in alignment pipeline — teaches format.

✓ Simple · fast · prerequisite for RLHF

✗ Mimics behavior without understanding preferences

RLHF PIPELINE — COMPLETE FLOW

📚 Pre-trainNext-token predict on internet text

→

✍️ SFTFine-tune on human demonstrations

→

👥 Human RankHumans rank model outputs A > B > C

→

🏆 Reward ModelTrained to predict human preference

→

🎮 PPOPolicy optimized against reward model

→

✅ Aligned LLMHelpful · Harmless · Honest

10 / ARCHITECTURE

Mixture of Experts (MoE)

Instead of activating all parameters for every token, MoE routes each token to a small subset of "expert" FFN layers. Same quality as a dense model at a fraction of the compute cost per token.

📥 Input Token: "Translate to French:"

↓

🔀 Gating Network
Softmax over expert scores → select Top-K (e.g. K=2)

↓ routes to 2 of 8 experts

🇫🇷Language
Expert

🧮Math
Expert

📝Writing
Expert

💻Code
Expert

🔬Science
Expert

📊Data
Expert

⚖️Legal
Expert

🏥Medical
Expert

↓ weighted sum of expert outputs

✅ "Traduire en français:" (output token)

8×7B

Mixtral: 8 experts, 7B each

Top-2

Experts activated per token

12.5B

Active params (of 46.7B total)

3–5×

Compute saving vs dense

Load Balancing Loss

Auxiliary loss term penalizes routing collapse (all tokens going to 1 expert). Ensures uniform expert utilization.

L_balance = α · Σ f_i · P_i

Expert Parallelism

Different experts sharded across GPUs. Each GPU hosts a subset. All-to-all communication routes tokens to the right GPU.

expert_i → GPU_i → all-to-all scatter

11 / MULTIMODAL

Multimodal RAG

Extend RAG beyond text to retrieve and reason over images, charts, tables, audio, and video. Vision-language models can now index and query visual documents without OCR.

📄

Text

Standard embedding + vector search

OpenAI · Cohere

🖼️

Images

CLIP embeddings · ColPali visual patches

CLIP · ColPali · LLaVA

📊

Tables/Charts

Screenshot → VLM describes → embed description

GPT-4V · Gemini

🎵

Audio

Whisper transcription → text RAG pipeline

Whisper · Deepgram

🎬

Video

Frame sampling + audio → multimodal index

Gemini 1.5 · LLaVA-Video

📐

PDFs/Slides

ColPali: index PDF pages as images directly

ColPali · Unstructured

COLPALI — VISUAL DOCUMENT RETRIEVAL (No OCR!)

📄 PDF Page
(as image)

→

🔲 Patch
Embeddings
32×32 patches

→

🗄️ Index
in Vector DB
per-patch vectors

→

💬 Query
Embedding
text → patches

→

🏆 MaxSim
Score
best patch match

→

✅ Top-K
Pages
→ VLM

12 / PRODUCTION

LLMOps Pipeline

Taking LLMs from prototype to reliable production requires systematic prompt versioning, regression testing, shadow deployments, A/B experiments, and continuous feedback loops — just like traditional MLOps.

CI/CD FOR LLM APPLICATIONS — FULL LIFECYCLE

✍️

Prompt Dev

Version control prompts in git. A/B template variants. Prompt registry.

🧪

Eval Suite

Run against golden dataset. RAGAS + LLM-judge. Regression gate.

👥

Human Review

Expert annotators flag edge cases. Capture hard examples for suite.

🕵️

Shadow Deploy

New prompt runs in parallel. Compare vs production without serving to users.

🔀

A/B Test

5% traffic to new variant. Monitor quality, latency, cost. Statistical significance.

🚀

Full Rollout

Canary → 10% → 50% → 100%. Auto-rollback on metric degradation.

📊

Monitor

Track drift, quality, cost, latency. User feedback loop back to eval suite.

Prompt Versioning

Treat prompts like code. Store in git with semantic versioning.

PromptLayer · LangSmith
Diff comparison between versions
Rollback in seconds

Eval-Driven Dev

Never ship a prompt change without running the eval suite first.

Golden Q&A dataset (100+ examples)
Automated scoring with RAGAS
Block merge if score drops >2%

Cost Tracking

Per-user, per-feature, per-model cost attribution.

Token usage by model tier
Cache hit rate tracking
Cost per successful completion

Data Flywheel

Production traffic becomes tomorrow's training data.

Thumbs up/down captures preference
Hard examples → eval suite
Periodic fine-tune on collected data

🔭

LangSmith

Tracing, evals, datasets, prompt hub for LangChain apps.

📡

Helicone

LLM observability: cost, latency, caching proxy — one line of code.

🧠

Weights & Biases

Experiment tracking for fine-tuning runs. Prompt management.

🔬

Arize Phoenix

Open-source LLM observability. Spans, evals, drift detection.

Complete Concept Map

All 12 extended concepts and where they fit in the AI system stack

Concept

Layer

Core Mechanism

Key Benefit

🧠 Memory Types

State Management

In-context + vector + KG + weights

Long-horizon continuity

💬 CoT / ToT / ReAct

Reasoning

Structured thought traces + branching

+20–40% accuracy on hard tasks

🗺️ Planning Strategies

Agent Control

Sequential / parallel / DAG execution

3–5× task speed, recovery from failure

🔄 Reflection

Self-Improvement

Critique → revise loop

Higher quality without extra data

🔧 Tool Use

Capability Extension

JSON function call → external action

Real-world grounding, current data

⚡ Inference Opt

Infrastructure

Quant · KV cache · speculative decode

3–24× speed, 8× memory reduction

🔀 Model Routing

Cost Optimization

Classifier routes to cheapest capable model

60–90% cost reduction

🕸️ Knowledge Graphs

Structured Knowledge

(entity, relation, entity) triples + traversal

Multi-hop reasoning, no hallucination

⚖️ Alignment

Safety & Values

RLHF · DPO · Constitutional AI

Helpful, harmless, honest behavior

🎛️ Mixture of Experts

Model Architecture

Sparse routing to Top-K FFN experts

3–5× compute efficiency

🖼️ Multimodal RAG

Data Modalities

CLIP/ColPali for visual retrieval

Index PDFs/images without OCR

🚀 LLMOps

Production

CI/CD · evals · shadow deploy · flywheel

Reliable, improving production systems

Memory, Reasoning& Production

How These Concepts Connect