COMPLETE AI PIPELINE — VISUAL GUIDE

From Tokens to
Agents at Scale

16 interconnected concepts — LLMs, RAG, Agents, MCP, A2A, Evaluation, Guardrails, and Observability — explained with live data flow animations.

🧠 LLMs 📄 Chunking 🗄️ Vector DB 🔍 RAG 🤖 Agents ⚡ MCP 📊 Eval 👁️ Observability

Full Pipeline at a Glance

Click any node to jump to that concept

01 / FOUNDATION

Large Language Model

A neural network trained to predict the next token over massive text corpora. Billions of parameters encode language, facts, and reasoning. Everything else builds on top of it.

INPUT TOKENS
<sys>Youareahelpfulassistant</sys> <user>Explainquantumcomputing</user>
TRANSFORMER ARCHITECTURE 70B params · 96 layers · 8192 ctx
Token Embedding
d=4096
Positional Encoding
RoPE
Multi-Head Attention
h=64
Feed-Forward Network
SwiGLU
RMS Layer Norm
pre-norm
LM Head + Softmax
vocab=32k
OUTPUT →
Quantum computing uses qubits that can…
02 / SPECIALIZATION

Fine-Tuning

Adapt a pre-trained foundation model to your specific domain by continuing training on curated task data. Reduces hallucination and improves accuracy for domain-specific use cases.

🧠
Foundation
General LLM — knows everything broadly
📦
Dataset
Curated domain Q&A pairs + instructions
🔥
Training
Gradient descent minimizes cross-entropy
Specialist
Expert model — precise & aligned

Full Fine-Tuning

Updates all model weights. Maximum performance but requires significant GPU memory.

100% params

LoRA

Injects low-rank adapter matrices into attention layers. 100× fewer trainable params.

r=8..64 rank

QLoRA

LoRA on a 4-bit quantized model. Fine-tune 65B on a single GPU.

4-bit quant

RLHF

Reinforcement Learning from Human Feedback. Teaches model to follow instructions safely.

PPO / DPO
TRAINING LOSS CURVE — CONVERGING
epoch 0 epoch 5 10 train loss val loss
03 / INGESTION

Types of Chunking

Before documents can be embedded and stored, they must be split into retrievable pieces. Chunking strategy dramatically affects retrieval quality — choose wisely.

RAW DOCUMENT → CHUNKS
✂️Fixed-Size
chunk-1
chunk-2
chunk-3
Split every N tokens (e.g. 512) with optional overlap. Simple but can break sentences mid-thought.
✓ Predictable size, fast
✗ Ignores semantic boundaries
🔀Recursive Splitter
paragraph
sentence
word
Split by [\n\n, \n, ., " "] hierarchically until chunks fit size limit. Respects structure.
✓ Preserves structure
✗ Inconsistent chunk sizes
🧠Semantic
topic A
topic B
Embed sentences, detect cosine similarity drops → split there. Groups conceptually similar content.
✓ Best retrieval quality
✗ Slower, needs embed model
📑Document-Aware
H1: Intro
H2: Data
H3: API
Use Markdown headers, HTML tags, or PDF structure to split at logical document sections.
✓ Structure-aware
✗ Format-specific parsers needed
🪟Sliding Window
[A·B·C]
[B·C·D]
[C·D·E]
Fixed-size chunks with overlap (e.g. 50 tokens) so context spans chunk boundaries.
✓ No context loss at boundaries
✗ Duplicate embeddings stored
🌲Hierarchical (RAPTOR)
summary cluster
leaf
leaf
leaf
Cluster and summarize chunks recursively into a tree. Enables multi-resolution retrieval.
✓ Handles long-range reasoning
✗ Complex, high storage cost
04 / REPRESENTATION

Embeddings

Convert text chunks into dense numerical vectors where semantic similarity = geometric closeness. The bridge between human language and machine-searchable space.

"Quantum computing
uses qubits…"
EMBEDDING MODEL
text-embedding-3-large
3072 dimensions
VECTOR
…3072d
[0.23, -0.87, 0.41, …]
SEMANTIC VECTOR SPACE — SIMILAR CONCEPTS CLUSTER TOGETHER
⚛ Physics
💻 Computing
🧬 Biology
05 / STORAGE

Vector Database

Purpose-built databases that store and search dense embedding vectors using ANN (Approximate Nearest Neighbor) algorithms. Orders of magnitude faster than brute-force cosine search.

Weaviate
Open Source
Speed~5ms
ScaleMillions
Hybrid✓ native
Qdrant
Rust · Fast
Speed~2ms
FilteringExcellent
MemoryEfficient
ChromaDB
Local / Proto
Speed~10ms
Setuppip install
ScaleThousands
HNSW

Hierarchical Navigable Small World graph. Best accuracy/speed tradeoff. Default in most VDBs.

IVF

Inverted File Index. Clusters vectors into Voronoi cells. Fast at massive scale (100M+).

PQ

Product Quantization. Compresses vectors 8-32× for memory efficiency at cost of accuracy.

FLAT

Brute-force exact search. 100% recall but O(n) — only for small datasets (<100k).

06 / QUERY INTELLIGENCE

Query Decomposition

Complex questions rarely map to a single vector search. Decompose them into focused sub-queries, rewrite ambiguities, and expand with hypothetical answers to maximize recall.

💬 "Compare the economic and environmental impact of electric vs hydrogen vehicles in 2024"
↓ DECOMPOSED INTO →
1
Economic cost of electric vehicles 2024
2
Economic cost of hydrogen fuel cell cars 2024
3
Carbon footprint electric vehicle lifecycle
4
Environmental impact green hydrogen production
Sub-Query Split

Break multi-part questions into atomic retrieval tasks, each targeting one concept.

Query Rewriting

LLM reformulates ambiguous query into clearer, more searchable form before embedding.

HyDE

Hypothetical Document Embeddings — generate a fake answer, embed it, search on that vector.

Step-Back

Abstract the specific question to a more general version that captures broader context.

Multi-Query

Generate N rephrasings of the query, retrieve for each, union the results.

RAG-Fusion

Multi-query + Reciprocal Rank Fusion to merge and re-rank results from all sub-queries.

07 / RETRIEVAL

Types of Retrieval

Different retrieval strategies suit different queries. Dense retrieval excels at semantic similarity; sparse at exact keyword matching. Hybrid combines both for maximum coverage.

🔵 Dense Retrieval

Embed query + documents into the same vector space. Find nearest neighbors by cosine / dot-product similarity. Best for semantic questions.

similarity = q · d / (|q||d|)

🟠 Sparse (BM25)

TF-IDF variant that scores keyword overlap. Excels at exact term matching — product names, IDs, proper nouns. Doesn't understand semantics.

BM25(q,d) = Σ IDF(qi)·f(qi,d)

🟢 Hybrid Search

Combine dense + sparse scores via Reciprocal Rank Fusion or weighted sum. Best of both worlds — semantics + exact match.

RRF(d) = Σ 1/(k + rank_i)

🟣 Graph RAG

Build a knowledge graph from entities and relations. Traverse graph edges to answer multi-hop questions that span many documents.

entity → relation → entity
RETRIEVAL RESULTS — RANKED BY SCORE
0.94
…quantum bits maintain superposition until measured, enabling parallel computation…
0.87
…qubits exploit entanglement to process exponentially more states than classical bits…
0.79
…Shor's algorithm leverages quantum interference to factor large integers efficiently…
0.61
…IBM's 433-qubit Osprey processor achieved error rates below fault-tolerance threshold…
08 / REFINEMENT

Reranking

A cross-encoder model re-scores retrieved chunks against the query jointly, capturing fine-grained relevance that the bi-encoder embedding model missed. Precision over recall.

BEFORE — ANN RETRIEVAL ORDER
#1
Quantum history
#2
Qubit superposition
#3
IBM quantum chip
#4
Shor's algorithm
#5
Quantum gates
🏆
Cross-
Encoder
↓↑
Reranks top-K
with full attention
AFTER — RERANKED ORDER
#1
Qubit superposition ✓
#2
Shor's algorithm ✓
#3
Quantum gates
#4
IBM quantum chip
#5
Quantum history
09 / FULL PIPELINE

RAG — End-to-End

Retrieval-Augmented Generation unifies all previous steps. The user query flows through decomposition → retrieval → reranking → context injection → LLM generation.

── OFFLINE (INDEXING PHASE) ──────────────────────────
📄 Docs
✂️ Chunk
🔢 Embed
🗄️ Vector DB
── ONLINE (QUERY PHASE) ───────────────────────────────
💬 Query
🔀 Decompose
🔢 Embed
🔍 Retrieve
🏆 Rerank
── GENERATION PHASE ───────────────────────────────────
[SYSTEM] You are a helpful assistant. Use only the provided context.
[CONTEXT] Chunk #1: Qubits exploit superposition to represent 0 and 1 simultaneously…
            Chunk #2: Shor's algorithm factors integers in polynomial time using quantum gates…
[USER] Explain quantum computing simply.
🧠 LLM
✅ Grounded Answer
10 / AUTONOMY

Single Agent

An LLM given tools and a Perceive → Plan → Act → Reflect loop. It autonomously decides which tool to call, executes it, observes the result, and iterates toward a goal.

🤖
AGENT
🔍
Search
💾
Memory
Execute
🗺️
Plan
AGENT EXECUTION TRACE
💭THINK:I need to find the latest GDP data, then analyze trends.
🔧TOOL CALL:web_search("global GDP growth 2024 IMF")
📥OBSERVE:Results: IMF projects 3.2% global growth in 2024…
💭THINK:Now I'll run analysis code on this data.
🔧TOOL CALL:code_exec("import pandas as pd; df.describe()")
DONE:Goal achieved. Returning analysis to user.
11 / COLLABORATION

Multi-Agent Systems

Specialized agents collaborate under an orchestrator, parallelizing complex tasks. Each agent owns a domain and communicates structured results through a shared protocol.

🎯 Orchestrator Agent — Plans & Delegates
🔍
Researcher
Web search & summarize
💻
Coder
Write & execute code
🔎
Critic
Review & validate
✍️
Writer
Draft final output
🛡️
Safety
Check guardrails
📦 Final Verified Deliverable
orchestratorresearcher: "Find top 3 papers on transformer efficiency"
researcherorchestrator: [{title, abstract, url} × 3]
orchestratorcoder: "Implement FlashAttention from paper #1"
codercritic: def flash_attention(q,k,v): ...
criticorchestrator: "Approved ✓ — edge case on empty context fixed"
12 / PROTOCOL

Model Context Protocol (MCP)

Anthropic's open standard giving AI models a universal, typed interface to tools and data sources. One protocol replaces hundreds of bespoke integrations.

HOST APP
Claude Desktop VS Code Custom Agent Web App
⚡ MCP — JSON-RPC 2.0 · stdio / SSE / HTTP Streams
📁
Filesystem

Read/write local files

🌐
Browser

Fetch & scrape web

🗃️
Database

SQL / NoSQL queries

🔧
GitHub

Repos, PRs, issues

📧
Email

Read/send messages

📅
Calendar

Events & scheduling

REQUEST LIFECYCLE — JSON-RPC OVER STDIO
🧠 LLMselects tool
📤 Request{"method":"tools/call"}
🔧 MCP Servervalidates + runs
📥 Response{"result":…}
💬 LLMcontinues reasoning
13 / INTEROPERABILITY

Agent-to-Agent Protocol (A2A)

Google's open protocol enabling agents on different frameworks to discover each other, delegate tasks, and exchange typed messages — the HTTP of the multi-agent web.

🤖

Agent Alpha

Research & Planning

Claude / LangChain
A2A CHANNEL
→ task_request
← task_result
→ stream_chunk
← artifact
JSON / HTTP / SSE
🦾

Agent Beta

Code & Execution

GPT-4 / AutoGen
🔍Agent Discovery

Agents publish capabilities at /.well-known/agent.json — auto-discoverable by peers.

📋Task Delegation

Structured Task objects with typed inputs, outputs, status, and cancellation support.

📡Streaming

Server-Sent Events for long-running tasks — progress updates flow continuously.

🔐Auth & Security

OAuth 2.0 and API key support — enterprise-grade access control between agents.

🌐Framework Agnostic

Claude, GPT, Gemini, LangChain, AutoGen — all interoperate via the same protocol.

📦Rich Artifacts

Text, files, structured JSON, images — all supported as typed task output artifacts.

14 / MEASUREMENT

Evaluation

Systematic measurement of LLM and RAG system quality. Without rigorous evals, you're flying blind — you can't improve what you can't measure.

Faithfulness0.91
Answer only contains info from retrieved context. Detects hallucination.
Answer Relevancy0.87
Does the answer actually address the user's question?
Context Precision0.78
Fraction of retrieved chunks that were actually relevant (signal-to-noise).
Context Recall0.83
Did retrieval find all necessary information to answer the question?
Automated (RAGAS)
  • Faithfulness score
  • Answer relevancy
  • Context precision/recall
  • BLEU / ROUGE
LLM-as-Judge
  • GPT-4 grades outputs
  • Rubric-based scoring
  • Pairwise comparison
  • Constitutional AI
Human Eval
  • Expert annotation
  • A/B user testing
  • Likert scale rating
  • Ground truth labels
Behavioral
  • Adversarial prompts
  • Robustness testing
  • Latency benchmarks
  • Cost per query
15 / SAFETY

Guardrails

Input and output filters that detect and block harmful, off-topic, or policy-violating content. Guardrails sit as a protective layer around every LLM call.

DATA FLOW WITH GUARDRAILS
💬 User Input
🛡️ Input Guard
🧠 LLM
🛡️ Output Guard
✅ Safe Response
✅ On-topic
✅ Not PII
🚫 Prompt inject
✅ Compliant
🚫 PII detected
✅ Non-toxic
🎯Topic Filter

Detects off-topic requests outside the system's intended domain.

competitor mentionoff-topic
🔒PII Detection

Identifies and redacts SSN, credit cards, emails, phone numbers from inputs and outputs.

SSNCC numberemail
💉Prompt Injection

Detects attempts to override system prompt or hijack agent behavior.

ignore instructionsDAN
☠️Toxicity / Harm

NLI classifiers detect hate speech, violence, self-harm, CBRN content.

hate speechviolence
Factual Grounding

Validates output against retrieved context — flags unsupported claims (hallucination detection).

hallucinationcitation
⚖️Compliance

Industry-specific rules: HIPAA, GDPR, SOX, financial advice disclaimers.

HIPAAGDPRSOX
16 / MONITORING

Observability

Full-stack visibility into your AI system — traces, metrics, logs, and alerts. You can't optimize what you can't see. Observability closes the loop between deployment and improvement.

LIVE DASHBOARD
98.7%
Uptime
1.2s
P50 Latency
4.1s
P99 Latency
0.89
Avg RAGAS
0.3%
Error Rate
$0.04
Cost/Query
QUERY VOLUME — 24H
DISTRIBUTED TRACE — SINGLE REQUEST
total request
3840ms
query embed
120ms
ANN search
80ms
rerank
210ms
LLM generate
2850ms
guardrails
180ms
📡Traces

End-to-end request tracing across retrieval, LLM calls, tool use, and agent steps. OpenTelemetry compatible.

📊Metrics

Latency P50/P95/P99, token usage, cost per query, retrieval hit rate, hallucination rate.

📝Logs

Structured logs of prompts, completions, retrieved chunks, tool calls, and guardrail decisions.

🚨Alerts

Real-time alerts on latency spikes, error rate increases, quality degradation, and cost anomalies.

Full Pipeline Summary

Concept
Role in Pipeline
Key Mechanism
Key Tool / Metric
🧠 LLM
Foundation
Transformer · next-token prediction
GPT-4o, Claude, Llama
⚙️ Fine-Tuning
Specialization
LoRA / QLoRA / RLHF on domain data
Loss curve, eval accuracy
✂️ Chunking
Ingestion
Fixed / Semantic / Hierarchical split
Chunk size, overlap
🔢 Embeddings
Representation
Dense vectors · cosine similarity
MTEB benchmark
🗄️ Vector DB
Storage & Search
HNSW / IVF ANN index
Pinecone, Qdrant, Weaviate
🔀 Query Decomp
Query Intelligence
Sub-queries / HyDE / RAG-Fusion
Recall improvement %
🔍 Retrieval
Finding Context
Dense / Sparse / Hybrid / Graph
MRR, NDCG, Recall@K
🏆 Reranking
Refinement
Cross-encoder re-scores top-K
Cohere Rerank, Jina
📚 RAG
Full Pipeline
Retrieve → Augment → Generate
RAGAS score
🤖 Single Agent
Autonomy
Perceive → Plan → Act loop
Task completion rate
🌐 Multi-Agent
Collaboration
Orchestrator + specialist agents
Parallelism, throughput
⚡ MCP
Tool Protocol
JSON-RPC 2.0 typed tool calls
Tool call latency
🔗 A2A
Agent Interop
HTTP task delegation protocol
Cross-framework compat
📊 Evaluation
Measurement
RAGAS / LLM-judge / Human eval
Faithfulness, relevancy
🛡️ Guardrails
Safety
Input/output NLI classifiers
Block rate, false positive
👁️ Observability
Monitoring
Traces · metrics · logs · alerts
Latency, cost, error rate