AI Concepts — Frontier Volume III

01 / CONTEXT ENGINEERING

What Fits in the Window

The context window is the single most precious resource in an LLM system. Context engineering is the discipline of deciding what to put in, what to compress, what to cache, and what to leave out — because everything you add costs latency, money, and attention.

CONTEXT WINDOW · 128K TOKENS used: 104,200 / 131,072 · 79%

SYS

CONVERSATION HISTORY

RETRIEVED DOCS (RAG)

⚠ DEAD ZONE

USER QUERY

System prompt · 2K

History · 18K

RAG chunks · 62K

User · 8K

Lost in the middle

⚠ ISSUE Information placed in the middle of long contexts is recalled significantly worse than info at the start or end — the U-shaped recall curve.

The U-Shaped Recall Curve

RECALL ACCURACY vs. POSITION IN CONTEXT

📦

Sliding Window

Keep only the last N turns. Drop oldest messages as new ones arrive. Simple, predictable, but loses long-term context.

window = msgs[-N:]

🗜️

Summarization

Periodically compress old messages into a running summary. Preserves narrative continuity at the cost of fidelity.

summary = LLM(old_msgs)

🎯

Selective Retrieval

Embed past turns; only re-inject the ones semantically relevant to the new query. Memory becomes searchable, not sequential.

retrieve(history, query)

🔼

Reorder for Position Bias

Place the most critical facts at the top OR bottom of the prompt — never the middle, where attention degrades.

[critical, ...rest, query]

🪜

Hierarchical Compression

Recent turns: full text. Older turns: summaries. Oldest: just topic tags. Detail decays gracefully with age.

recent → summary → tag

🚫

Aggressive Pruning

Strip whitespace, drop low-information chunks, deduplicate, remove tool-call boilerplate before sending to the model.

prune(ctx, threshold)

PROMPT CACHING — KV-CACHE REUSE

📌 Cached Prefix system + tools + few-shot · 12K tokens · 1¢ read

CACHE HIT→

✏️ New Tokens fresh user message · 200 tokens · 8¢ write

→

⚡ Response generated · 90% latency reduction

10×

CHEAPER READS

85%

LATENCY DROP

5min

DEFAULT TTL

1024+

MIN TOKENS

02 / HALLUCINATION & GROUNDING

Why Models Lie
(and How to Stop Them)

LLMs don't have a mechanism for distinguishing what they know from what they're plausibly generating — every answer is a fluent guess. Grounding ties generation to verifiable sources, so the model is forced to reference reality instead of inventing it.

📚

Factual Hallucination

Fabricated facts that contradict reality — wrong dates, invented names, fictional citations, made-up statistics.

Q: When did Einstein win the Nobel?
A: "In 1922 for relativity" (it was 1921, for the photoelectric effect)

🔗

Faithfulness Hallucination

Output contradicts or extends beyond the source documents provided in context — most common failure mode in RAG systems.

Source says: "Revenue grew."
Model says: "Revenue grew 23%." (number invented)

🧩

Logical Hallucination

Internally inconsistent reasoning — the model contradicts a claim it just made, or its conclusion doesn't follow from premises.

Step 1: "All birds fly."
Step 3: "Penguins, which are birds, cannot fly." (premise was wrong)

📐

Schema Hallucination

Tool calls or structured outputs that violate the expected format — invented field names, wrong types, malformed JSON.

Schema: { "user_id": int }
Output: { "userId": "alice" }

⚠ UNGROUNDED RESPONSE

Who founded the company in 2018 and what was its first product?

Acme Corp was founded in March 2018 by Sarah Chen and David Park, two former Google engineers. Their first product was a developer-facing API for image classification that launched in late 2018 and reached 10K users within six months.

⚠ NO SOURCES PROVIDED

model generated plausible details from priors

✓ GROUNDED RESPONSE

Who founded the company in 2018 and what was its first product?

The company was founded by Maya Patel¹ in October 2018¹. Their first product was a real-time analytics dashboard² released in early 2019². The retrieved sources do not mention co-founders or user metrics.

[1] about-us.html · "Founder Maya Patel established..."

[2] press-2019-q1.pdf · p.3 · "launched our flagship..."

Grounding Metrics — How Faithful Is the Output?

HIGHER = MORE FAITHFUL TO SOURCES

Faithfulness

claims supported by retrieved context

88%

Answer Relevance

does the answer address the question

94%

Context Precision

retrieved chunks ranked by relevance

71%

Context Recall

ground truth claims found in context

82%

Citation Accuracy

cited source actually supports claim

76%

HALLUCINATION MITIGATION PIPELINE

🔍 Retrievefind evidence first

→

📎 Citeforce inline references

→

🤐 Abstainsay "I don't know"

→

⚖️ Verifysecond-pass NLI check

→

🎯 Scoreconfidence calibration

03 / TEST-TIME COMPUTE

Thinking Before Answering

For years, LLM capability scaled with training compute. Reasoning models flipped that — they spend more compute at inference time, generating long internal chains of thought, exploring branches, and verifying their own work before producing the final answer.

Pre-2024: Train Big, Answer Fast

Capability is baked in during pretraining. Inference is a one-shot forward pass. More tokens at inference doesn't help — the model already "knows" or it doesn't.

capability ∝ training_FLOPs

2024+: Train Less, Think More

Models trained to reason can substantially improve answer quality by spending more tokens thinking. Capability is now a knob you turn at inference time.

capability ∝ train + log(test_FLOPs)

REASONING TRACE — A QUEEN'S GAMBIT PROBLEM

thinking tokens: 2,847 · output tokens: 38

[THINK]The user is asking which knight move is best after 1.d4 d5 2.c4. Let me consider candidate moves: Nc3 develops naturally...

[THINK]Wait, but Nc3 blocks the c-pawn. In the Queen's Gambit, white usually wants c4 to stay flexible. Let me check Nf3 instead.

[BACKTRACK]Actually I was confused — c4 is already played, so blocking the c-pawn isn't relevant anymore. Reconsidering Nc3.

[THINK]If black plays dxc4, white can recapture with e3 then Bxc4. This gives white an open game with central pressure.

[VERIFY]Cross-checking with opening theory: yes, Nc3 is the main line of the Queen's Gambit Declined. My analysis is consistent.

[ANSWER]The standard move is 3.Nc3, developing the knight while supporting a future e4 push. Black typically responds with Nf6 or e6.

The New Scaling Law

ACCURACY vs. INFERENCE COMPUTE

STRATEGY 01

Chain-of-Thought (CoT)

Prompt the model to reason step-by-step before answering. Free, simple, surprisingly effective on math and logic tasks. The original test-time technique.

+ tokens + accuracy

STRATEGY 02

Self-Consistency

Generate N different reasoning paths at non-zero temperature, then vote on the most common answer. Trades linear cost for substantial accuracy gains on hard problems.

N× cost + robustness

STRATEGY 03

Tree of Thoughts (ToT)

Explore multiple branches of reasoning, evaluate intermediate states, and prune dead ends. Search through reasoning space instead of committing to one chain.

complex orchestration + exploration

STRATEGY 04

Reasoning Models (o1, R1)

Models RL-trained to produce extended thinking traces natively. The model decides on its own when to think more, backtrack, verify, or commit to an answer.

expensive · slow SOTA on reasoning

STRATEGY 05

Best-of-N + Verifier

Generate N candidate answers, score each with a learned reward model or verifier, return the highest-scoring one. Effective when verification is easier than generation.

N+1 calls + quality floor

STRATEGY 06

Process Reward Models

Score not just the final answer but each intermediate reasoning step. Allows beam-search-style exploration where the model commits to good steps and abandons bad ones.

requires PRM step-level signal

Three Concepts at a Glance

The frontier of production LLM systems — what to put in, what to trust, and how hard to think.

CONCEPT

PROBLEM IT SOLVES

KEY TECHNIQUES

WHEN IT MATTERS

📐 Context Engineering

Limited window, position bias, cost of long prompts

Sliding window · summarization · selective retrieval · prompt caching · KV-cache reuse

Long conversations, RAG-heavy apps, cost-sensitive deployments

🪞 Hallucination & Grounding

Models confidently fabricate plausible-sounding falsehoods

Forced citations · NLI verification · abstention · faithfulness scoring · grounded decoding

High-stakes domains: legal, medical, finance, anywhere truth matters

🧮 Test-Time Compute

One-shot generation isn't enough for hard reasoning

CoT · self-consistency · Tree of Thoughts · reasoning models · Best-of-N + verifier

Math, code, multi-step planning, anywhere a wrong answer is worse than a slow one

Context, Truth
& Reasoning

Three Pillars of Modern LLMs

What Fits in the Window

The U-Shaped Recall Curve

Why Models Lie
(and How to Stop Them)

Factual Hallucination

Faithfulness Hallucination

Logical Hallucination

Schema Hallucination

Grounding Metrics — How Faithful Is the Output?

Thinking Before Answering

Pre-2024: Train Big, Answer Fast

2024+: Train Less, Think More

REASONING TRACE — A QUEEN'S GAMBIT PROBLEM

The New Scaling Law

Chain-of-Thought (CoT)

Self-Consistency

Tree of Thoughts (ToT)

Reasoning Models (o1, R1)

Best-of-N + Verifier

Process Reward Models

Three Concepts at a Glance

Context, Truth& Reasoning

Three Pillars of Modern LLMs

What Fits in the Window

The U-Shaped Recall Curve

Why Models Lie(and How to Stop Them)

Factual Hallucination

Faithfulness Hallucination

Logical Hallucination

Schema Hallucination

Grounding Metrics — How Faithful Is the Output?

Thinking Before Answering

Pre-2024: Train Big, Answer Fast

2024+: Train Less, Think More

REASONING TRACE — A QUEEN'S GAMBIT PROBLEM

The New Scaling Law

Chain-of-Thought (CoT)

Self-Consistency

Tree of Thoughts (ToT)

Reasoning Models (o1, R1)

Best-of-N + Verifier

Process Reward Models

Three Concepts at a Glance

Context, Truth
& Reasoning

Why Models Lie
(and How to Stop Them)