AI Pipeline — Complete Visual Guide

01 / FOUNDATION

Large Language Model

A neural network trained to predict the next token over massive text corpora. Billions of parameters encode language, facts, and reasoning. Everything else builds on top of it.

INPUT TOKENS

<sys>Youareahelpfulassistant</sys> <user>Explainquantumcomputing</user>

↓

TRANSFORMER ARCHITECTURE 70B params · 96 layers · 8192 ctx

Token Embedding

d=4096

Positional Encoding

RoPE

Multi-Head Attention

h=64

Feed-Forward Network

SwiGLU

RMS Layer Norm

pre-norm

LM Head + Softmax

vocab=32k

↓

OUTPUT →

Quantum computing uses qubits that can…

02 / SPECIALIZATION

Fine-Tuning

Adapt a pre-trained foundation model to your specific domain by continuing training on curated task data. Reduces hallucination and improves accuracy for domain-specific use cases.

🧠

Foundation

General LLM — knows everything broadly

→

📦

Dataset

Curated domain Q&A pairs + instructions

→

🔥

Training

Gradient descent minimizes cross-entropy

→

✅

Specialist

Expert model — precise & aligned

Full Fine-Tuning

Updates all model weights. Maximum performance but requires significant GPU memory.

100% params

LoRA

Injects low-rank adapter matrices into attention layers. 100× fewer trainable params.

r=8..64 rank

QLoRA

LoRA on a 4-bit quantized model. Fine-tune 65B on a single GPU.

4-bit quant

RLHF

Reinforcement Learning from Human Feedback. Teaches model to follow instructions safely.

PPO / DPO

TRAINING LOSS CURVE — CONVERGING

03 / INGESTION

Types of Chunking

Before documents can be embedded and stored, they must be split into retrievable pieces. Chunking strategy dramatically affects retrieval quality — choose wisely.

RAW DOCUMENT → CHUNKS

✂️Fixed-Size

chunk-1

chunk-2

chunk-3

Split every N tokens (e.g. 512) with optional overlap. Simple but can break sentences mid-thought.

✓ Predictable size, fast

✗ Ignores semantic boundaries

🔀Recursive Splitter

paragraph

sentence

word

Split by [\n\n, \n, ., " "] hierarchically until chunks fit size limit. Respects structure.

✓ Preserves structure

✗ Inconsistent chunk sizes

🧠Semantic

topic A

topic B

Embed sentences, detect cosine similarity drops → split there. Groups conceptually similar content.

✓ Best retrieval quality

✗ Slower, needs embed model

📑Document-Aware

H1: Intro

H2: Data

H3: API

Use Markdown headers, HTML tags, or PDF structure to split at logical document sections.

✓ Structure-aware

✗ Format-specific parsers needed

🪟Sliding Window

[A·B·C]

[B·C·D]

[C·D·E]

Fixed-size chunks with overlap (e.g. 50 tokens) so context spans chunk boundaries.

✓ No context loss at boundaries

✗ Duplicate embeddings stored

🌲Hierarchical (RAPTOR)

summary cluster

leaf

Cluster and summarize chunks recursively into a tree. Enables multi-resolution retrieval.

✓ Handles long-range reasoning

✗ Complex, high storage cost

04 / REPRESENTATION

Embeddings

Convert text chunks into dense numerical vectors where semantic similarity = geometric closeness. The bridge between human language and machine-searchable space.

"Quantum computing
uses qubits…"

→

EMBEDDING MODEL

text-embedding-3-large
3072 dimensions

→

VECTOR

…3072d

[0.23, -0.87, 0.41, …]

SEMANTIC VECTOR SPACE — SIMILAR CONCEPTS CLUSTER TOGETHER

⚛ Physics

💻 Computing

🧬 Biology

05 / STORAGE

Vector Database

Purpose-built databases that store and search dense embedding vectors using ANN (Approximate Nearest Neighbor) algorithms. Orders of magnitude faster than brute-force cosine search.

🔷

Pinecone

Managed Cloud

Speed~1ms

ScaleBillions

Ease★★★★★

🟣

Weaviate

Open Source

Speed~5ms

ScaleMillions

Hybrid✓ native

🟡

Qdrant

Rust · Fast

Speed~2ms

FilteringExcellent

MemoryEfficient

🟢

ChromaDB

Local / Proto

Speed~10ms

Setuppip install

ScaleThousands

HNSW

Hierarchical Navigable Small World graph. Best accuracy/speed tradeoff. Default in most VDBs.

IVF

Inverted File Index. Clusters vectors into Voronoi cells. Fast at massive scale (100M+).

PQ

Product Quantization. Compresses vectors 8-32× for memory efficiency at cost of accuracy.

FLAT

Brute-force exact search. 100% recall but O(n) — only for small datasets (<100k).

06 / QUERY INTELLIGENCE

Query Decomposition

Complex questions rarely map to a single vector search. Decompose them into focused sub-queries, rewrite ambiguities, and expand with hypothetical answers to maximize recall.

💬 "Compare the economic and environmental impact of electric vs hydrogen vehicles in 2024"

↓ DECOMPOSED INTO →

Economic cost of electric vehicles 2024

Economic cost of hydrogen fuel cell cars 2024

Carbon footprint electric vehicle lifecycle

Environmental impact green hydrogen production

Sub-Query Split

Break multi-part questions into atomic retrieval tasks, each targeting one concept.

Query Rewriting

LLM reformulates ambiguous query into clearer, more searchable form before embedding.

HyDE

Hypothetical Document Embeddings — generate a fake answer, embed it, search on that vector.

Step-Back

Abstract the specific question to a more general version that captures broader context.

Multi-Query

Generate N rephrasings of the query, retrieve for each, union the results.

RAG-Fusion

Multi-query + Reciprocal Rank Fusion to merge and re-rank results from all sub-queries.

07 / RETRIEVAL

Types of Retrieval

Different retrieval strategies suit different queries. Dense retrieval excels at semantic similarity; sparse at exact keyword matching. Hybrid combines both for maximum coverage.

🔵 Dense Retrieval

Embed query + documents into the same vector space. Find nearest neighbors by cosine / dot-product similarity. Best for semantic questions.

similarity = q · d / (|q||d|)

🟠 Sparse (BM25)

TF-IDF variant that scores keyword overlap. Excels at exact term matching — product names, IDs, proper nouns. Doesn't understand semantics.

BM25(q,d) = Σ IDF(qi)·f(qi,d)

🟢 Hybrid Search

Combine dense + sparse scores via Reciprocal Rank Fusion or weighted sum. Best of both worlds — semantics + exact match.

RRF(d) = Σ 1/(k + rank_i)

🟣 Graph RAG

Build a knowledge graph from entities and relations. Traverse graph edges to answer multi-hop questions that span many documents.

entity → relation → entity

RETRIEVAL RESULTS — RANKED BY SCORE

0.94

…quantum bits maintain superposition until measured, enabling parallel computation…

0.87

…qubits exploit entanglement to process exponentially more states than classical bits…

0.79

…Shor's algorithm leverages quantum interference to factor large integers efficiently…

0.61

…IBM's 433-qubit Osprey processor achieved error rates below fault-tolerance threshold…

08 / REFINEMENT

Reranking

A cross-encoder model re-scores retrieved chunks against the query jointly, capturing fine-grained relevance that the bi-encoder embedding model missed. Precision over recall.

BEFORE — ANN RETRIEVAL ORDER

Quantum history

Qubit superposition

IBM quantum chip

Shor's algorithm

Quantum gates

🏆
Cross-
Encoder

↓↑

Reranks top-K
with full attention

AFTER — RERANKED ORDER

Qubit superposition ✓

Shor's algorithm ✓

Quantum gates

IBM quantum chip

Quantum history

09 / FULL PIPELINE

RAG — End-to-End

Retrieval-Augmented Generation unifies all previous steps. The user query flows through decomposition → retrieval → reranking → context injection → LLM generation.

── OFFLINE (INDEXING PHASE) ──────────────────────────

📄 Docs

→

✂️ Chunk

→

🔢 Embed

→

🗄️ Vector DB

── ONLINE (QUERY PHASE) ───────────────────────────────

💬 Query

→

🔀 Decompose

→

🔢 Embed

→

🔍 Retrieve

→

🏆 Rerank

── GENERATION PHASE ───────────────────────────────────

[SYSTEM] You are a helpful assistant. Use only the provided context.
[CONTEXT] Chunk #1: Qubits exploit superposition to represent 0 and 1 simultaneously…
Chunk #2: Shor's algorithm factors integers in polynomial time using quantum gates…
[USER] Explain quantum computing simply.

🧠 LLM

→

✅ Grounded Answer

10 / AUTONOMY

Single Agent

An LLM given tools and a Perceive → Plan → Act → Reflect loop. It autonomously decides which tool to call, executes it, observes the result, and iterates toward a goal.

🤖

AGENT

🔍

💾

Memory

⚡

Execute

🗺️

Plan

AGENT EXECUTION TRACE

💭THINK:I need to find the latest GDP data, then analyze trends.

🔧TOOL CALL:web_search("global GDP growth 2024 IMF")

📥OBSERVE:Results: IMF projects 3.2% global growth in 2024…

💭THINK:Now I'll run analysis code on this data.

🔧TOOL CALL:code_exec("import pandas as pd; df.describe()")

✅DONE:Goal achieved. Returning analysis to user.

11 / COLLABORATION

Multi-Agent Systems

Specialized agents collaborate under an orchestrator, parallelizing complex tasks. Each agent owns a domain and communicates structured results through a shared protocol.

🎯 Orchestrator Agent — Plans & Delegates

🔍

Researcher

Web search & summarize

💻

Coder

Write & execute code

🔎

Critic

Review & validate

✍️

Writer

Draft final output

🛡️

Safety

Check guardrails

📦 Final Verified Deliverable

orchestrator → researcher: "Find top 3 papers on transformer efficiency"
researcher → orchestrator: [{title, abstract, url} × 3]
orchestrator → coder: "Implement FlashAttention from paper #1"
coder → critic: def flash_attention(q,k,v): ...
critic → orchestrator: "Approved ✓ — edge case on empty context fixed"

12 / PROTOCOL

Model Context Protocol (MCP)

Anthropic's open standard giving AI models a universal, typed interface to tools and data sources. One protocol replaces hundreds of bespoke integrations.

HOST APP

Claude Desktop VS Code Custom Agent Web App

↕

⚡ MCP — JSON-RPC 2.0 · stdio / SSE / HTTP Streams

↕

📁

Filesystem

Read/write local files

🌐

Browser

Fetch & scrape web

🗃️

Database

SQL / NoSQL queries

🔧

GitHub

Repos, PRs, issues

📧

Email

Read/send messages

📅

Calendar

Events & scheduling

REQUEST LIFECYCLE — JSON-RPC OVER STDIO

🧠 LLMselects tool

→

📤 Request{"method":"tools/call"}

→

🔧 MCP Servervalidates + runs

→

📥 Response{"result":…}

→

💬 LLMcontinues reasoning

13 / INTEROPERABILITY

Agent-to-Agent Protocol (A2A)

Google's open protocol enabling agents on different frameworks to discover each other, delegate tasks, and exchange typed messages — the HTTP of the multi-agent web.

🤖

Agent Alpha

Research & Planning

Claude / LangChain

A2A CHANNEL

→ task_request

← task_result

→ stream_chunk

← artifact

JSON / HTTP / SSE

🦾

Agent Beta

Code & Execution

GPT-4 / AutoGen

🔍Agent Discovery

Agents publish capabilities at /.well-known/agent.json — auto-discoverable by peers.

📋Task Delegation

Structured Task objects with typed inputs, outputs, status, and cancellation support.

📡Streaming

Server-Sent Events for long-running tasks — progress updates flow continuously.

🔐Auth & Security

OAuth 2.0 and API key support — enterprise-grade access control between agents.

🌐Framework Agnostic

Claude, GPT, Gemini, LangChain, AutoGen — all interoperate via the same protocol.

📦Rich Artifacts

Text, files, structured JSON, images — all supported as typed task output artifacts.

14 / MEASUREMENT

Evaluation

Systematic measurement of LLM and RAG system quality. Without rigorous evals, you're flying blind — you can't improve what you can't measure.

Faithfulness0.91

Answer only contains info from retrieved context. Detects hallucination.

Answer Relevancy0.87

Does the answer actually address the user's question?

Context Precision0.78

Fraction of retrieved chunks that were actually relevant (signal-to-noise).

Context Recall0.83

Did retrieval find all necessary information to answer the question?

Automated (RAGAS)

Faithfulness score
Answer relevancy
Context precision/recall
BLEU / ROUGE

LLM-as-Judge

GPT-4 grades outputs
Rubric-based scoring
Pairwise comparison
Constitutional AI

Human Eval

Expert annotation
A/B user testing
Likert scale rating
Ground truth labels

Behavioral

Adversarial prompts
Robustness testing
Latency benchmarks
Cost per query

15 / SAFETY

Guardrails

Input and output filters that detect and block harmful, off-topic, or policy-violating content. Guardrails sit as a protective layer around every LLM call.

DATA FLOW WITH GUARDRAILS

💬 User Input

→

🛡️ Input Guard

→

🧠 LLM

→

🛡️ Output Guard

→

✅ Safe Response

✅ On-topic

✅ Not PII

🚫 Prompt inject

✅ Compliant

🚫 PII detected

✅ Non-toxic

🎯Topic Filter

Detects off-topic requests outside the system's intended domain.

competitor mentionoff-topic

🔒PII Detection

Identifies and redacts SSN, credit cards, emails, phone numbers from inputs and outputs.

SSNCC numberemail

💉Prompt Injection

Detects attempts to override system prompt or hijack agent behavior.

ignore instructionsDAN

☠️Toxicity / Harm

NLI classifiers detect hate speech, violence, self-harm, CBRN content.

hate speechviolence

✅Factual Grounding

Validates output against retrieved context — flags unsupported claims (hallucination detection).

hallucinationcitation

⚖️Compliance

Industry-specific rules: HIPAA, GDPR, SOX, financial advice disclaimers.

HIPAAGDPRSOX

16 / MONITORING

Observability

Full-stack visibility into your AI system — traces, metrics, logs, and alerts. You can't optimize what you can't see. Observability closes the loop between deployment and improvement.

LIVE DASHBOARD

98.7%

Uptime

1.2s

P50 Latency

4.1s

P99 Latency

0.89

Avg RAGAS

0.3%

Error Rate

$0.04

Cost/Query

QUERY VOLUME — 24H

DISTRIBUTED TRACE — SINGLE REQUEST

total request

3840ms

query embed

120ms

ANN search

80ms

rerank

210ms

LLM generate

2850ms

guardrails

180ms

📡Traces

End-to-end request tracing across retrieval, LLM calls, tool use, and agent steps. OpenTelemetry compatible.

📊Metrics

Latency P50/P95/P99, token usage, cost per query, retrieval hit rate, hallucination rate.

📝Logs

Structured logs of prompts, completions, retrieved chunks, tool calls, and guardrail decisions.

🚨Alerts

Real-time alerts on latency spikes, error rate increases, quality degradation, and cost anomalies.

Full Pipeline Summary

Concept

Role in Pipeline

Key Mechanism

Key Tool / Metric

🧠 LLM

Foundation

Transformer · next-token prediction

GPT-4o, Claude, Llama

⚙️ Fine-Tuning

Specialization

LoRA / QLoRA / RLHF on domain data

Loss curve, eval accuracy

✂️ Chunking

Ingestion

Fixed / Semantic / Hierarchical split

Chunk size, overlap

🔢 Embeddings

Representation

Dense vectors · cosine similarity

MTEB benchmark

🗄️ Vector DB

Storage & Search

HNSW / IVF ANN index

Pinecone, Qdrant, Weaviate

🔀 Query Decomp

Query Intelligence

Sub-queries / HyDE / RAG-Fusion

Recall improvement %

🔍 Retrieval

Finding Context

Dense / Sparse / Hybrid / Graph

MRR, NDCG, Recall@K

🏆 Reranking

Refinement

Cross-encoder re-scores top-K

Cohere Rerank, Jina

📚 RAG

Full Pipeline

Retrieve → Augment → Generate

RAGAS score

🤖 Single Agent

Autonomy

Perceive → Plan → Act loop

Task completion rate

🌐 Multi-Agent

Collaboration

Orchestrator + specialist agents

Parallelism, throughput

⚡ MCP

Tool Protocol

JSON-RPC 2.0 typed tool calls

Tool call latency

🔗 A2A

Agent Interop

HTTP task delegation protocol

Cross-framework compat

📊 Evaluation

Measurement

RAGAS / LLM-judge / Human eval

Faithfulness, relevancy

🛡️ Guardrails

Safety

Input/output NLI classifiers

Block rate, false positive

👁️ Observability

Monitoring

Traces · metrics · logs · alerts

Latency, cost, error rate

From Tokens toAgents at Scale

Full Pipeline at a Glance

Large Language Model

Fine-Tuning

Full Fine-Tuning

LoRA

QLoRA

RLHF

Types of Chunking

Embeddings

Vector Database

HNSW

IVF

PQ

FLAT

Query Decomposition

Sub-Query Split

Query Rewriting

HyDE

Step-Back

Multi-Query

RAG-Fusion

Types of Retrieval

🔵 Dense Retrieval

🟠 Sparse (BM25)

🟢 Hybrid Search

🟣 Graph RAG

Reranking

RAG — End-to-End

Single Agent

Multi-Agent Systems

Model Context Protocol (MCP)

Filesystem

Browser

Database

GitHub

Email

Calendar

Agent-to-Agent Protocol (A2A)

Agent Alpha

Agent Beta

Evaluation

Automated (RAGAS)

LLM-as-Judge

Human Eval

Behavioral

Guardrails

Observability

Full Pipeline Summary

From Tokens to
Agents at Scale