FRONTIER AI CONCEPTS — VOLUME III

Context, Truth
& Reasoning

Three frontier topics that define modern LLM systems — what fits in the window, why models lie, and how compute at inference time changed everything.

📐 Context Engineering
🪞 Hallucination & Grounding
🧮 Test-Time Compute

Three Pillars of Modern LLMs

Each concept addresses a different layer — what the model sees, what it says, and how hard it thinks before answering.

01 / CONTEXT ENGINEERING

What Fits in the Window

The context window is the single most precious resource in an LLM system. Context engineering is the discipline of deciding what to put in, what to compress, what to cache, and what to leave out — because everything you add costs latency, money, and attention.

CONTEXT WINDOW · 128K TOKENS used: 104,200 / 131,072 · 79%
SYS
CONVERSATION HISTORY
RETRIEVED DOCS (RAG)
⚠ DEAD ZONE
USER QUERY
System prompt · 2K
History · 18K
RAG chunks · 62K
User · 8K
Lost in the middle
⚠ ISSUE Information placed in the middle of long contexts is recalled significantly worse than info at the start or end — the U-shaped recall curve.

The U-Shaped Recall Curve

RECALL ACCURACY vs. POSITION IN CONTEXT
START MIDDLE END ~95% ~52% ~92%
📦
Sliding Window
Keep only the last N turns. Drop oldest messages as new ones arrive. Simple, predictable, but loses long-term context.
window = msgs[-N:]
🗜️
Summarization
Periodically compress old messages into a running summary. Preserves narrative continuity at the cost of fidelity.
summary = LLM(old_msgs)
🎯
Selective Retrieval
Embed past turns; only re-inject the ones semantically relevant to the new query. Memory becomes searchable, not sequential.
retrieve(history, query)
🔼
Reorder for Position Bias
Place the most critical facts at the top OR bottom of the prompt — never the middle, where attention degrades.
[critical, ...rest, query]
🪜
Hierarchical Compression
Recent turns: full text. Older turns: summaries. Oldest: just topic tags. Detail decays gracefully with age.
recent → summary → tag
🚫
Aggressive Pruning
Strip whitespace, drop low-information chunks, deduplicate, remove tool-call boilerplate before sending to the model.
prune(ctx, threshold)
PROMPT CACHING — KV-CACHE REUSE
📌 Cached Prefix system + tools + few-shot · 12K tokens · 1¢ read
CACHE HIT
✏️ New Tokens fresh user message · 200 tokens · 8¢ write
⚡ Response generated · 90% latency reduction
10×
CHEAPER READS
85%
LATENCY DROP
5min
DEFAULT TTL
1024+
MIN TOKENS
02 / HALLUCINATION & GROUNDING

Why Models Lie
(and How to Stop Them)

LLMs don't have a mechanism for distinguishing what they know from what they're plausibly generating — every answer is a fluent guess. Grounding ties generation to verifiable sources, so the model is forced to reference reality instead of inventing it.

📚

Factual Hallucination

Fabricated facts that contradict reality — wrong dates, invented names, fictional citations, made-up statistics.

Q: When did Einstein win the Nobel?
A: "In 1922 for relativity" (it was 1921, for the photoelectric effect)
🔗

Faithfulness Hallucination

Output contradicts or extends beyond the source documents provided in context — most common failure mode in RAG systems.

Source says: "Revenue grew."
Model says: "Revenue grew 23%." (number invented)
🧩

Logical Hallucination

Internally inconsistent reasoning — the model contradicts a claim it just made, or its conclusion doesn't follow from premises.

Step 1: "All birds fly."
Step 3: "Penguins, which are birds, cannot fly." (premise was wrong)
📐

Schema Hallucination

Tool calls or structured outputs that violate the expected format — invented field names, wrong types, malformed JSON.

Schema: { "user_id": int }
Output: { "userId": "alice" }
⚠ UNGROUNDED RESPONSE
Who founded the company in 2018 and what was its first product?
Acme Corp was founded in March 2018 by Sarah Chen and David Park, two former Google engineers. Their first product was a developer-facing API for image classification that launched in late 2018 and reached 10K users within six months.
⚠ NO SOURCES PROVIDED
model generated plausible details from priors
✓ GROUNDED RESPONSE
Who founded the company in 2018 and what was its first product?
The company was founded by Maya Patel1 in October 20181. Their first product was a real-time analytics dashboard2 released in early 20192. The retrieved sources do not mention co-founders or user metrics.
[1] about-us.html · "Founder Maya Patel established..."
[2] press-2019-q1.pdf · p.3 · "launched our flagship..."

Grounding Metrics — How Faithful Is the Output?

HIGHER = MORE FAITHFUL TO SOURCES
Faithfulness
claims supported by retrieved context
88%
Answer Relevance
does the answer address the question
94%
Context Precision
retrieved chunks ranked by relevance
71%
Context Recall
ground truth claims found in context
82%
Citation Accuracy
cited source actually supports claim
76%
HALLUCINATION MITIGATION PIPELINE
🔍 Retrievefind evidence first
📎 Citeforce inline references
🤐 Abstainsay "I don't know"
⚖️ Verifysecond-pass NLI check
🎯 Scoreconfidence calibration
03 / TEST-TIME COMPUTE

Thinking Before Answering

For years, LLM capability scaled with training compute. Reasoning models flipped that — they spend more compute at inference time, generating long internal chains of thought, exploring branches, and verifying their own work before producing the final answer.

Pre-2024: Train Big, Answer Fast

Capability is baked in during pretraining. Inference is a one-shot forward pass. More tokens at inference doesn't help — the model already "knows" or it doesn't.

capability ∝ training_FLOPs

2024+: Train Less, Think More

Models trained to reason can substantially improve answer quality by spending more tokens thinking. Capability is now a knob you turn at inference time.

capability ∝ train + log(test_FLOPs)

REASONING TRACE — A QUEEN'S GAMBIT PROBLEM

thinking tokens: 2,847 · output tokens: 38
[THINK]The user is asking which knight move is best after 1.d4 d5 2.c4. Let me consider candidate moves: Nc3 develops naturally...
[THINK]Wait, but Nc3 blocks the c-pawn. In the Queen's Gambit, white usually wants c4 to stay flexible. Let me check Nf3 instead.
[BACKTRACK]Actually I was confused — c4 is already played, so blocking the c-pawn isn't relevant anymore. Reconsidering Nc3.
[THINK]If black plays dxc4, white can recapture with e3 then Bxc4. This gives white an open game with central pressure.
[VERIFY]Cross-checking with opening theory: yes, Nc3 is the main line of the Queen's Gambit Declined. My analysis is consistent.
[ANSWER]The standard move is 3.Nc3, developing the knight while supporting a future e4 push. Black typically responds with Nf6 or e6.

The New Scaling Law

ACCURACY vs. INFERENCE COMPUTE
train-time only (flat) 100 tok 1K tok 10K tok 100K tok ⚡ low compute high compute high low accuracy
STRATEGY 01

Chain-of-Thought (CoT)

Prompt the model to reason step-by-step before answering. Free, simple, surprisingly effective on math and logic tasks. The original test-time technique.

+ tokens + accuracy
STRATEGY 02

Self-Consistency

Generate N different reasoning paths at non-zero temperature, then vote on the most common answer. Trades linear cost for substantial accuracy gains on hard problems.

N× cost + robustness
STRATEGY 03

Tree of Thoughts (ToT)

Explore multiple branches of reasoning, evaluate intermediate states, and prune dead ends. Search through reasoning space instead of committing to one chain.

complex orchestration + exploration
STRATEGY 04

Reasoning Models (o1, R1)

Models RL-trained to produce extended thinking traces natively. The model decides on its own when to think more, backtrack, verify, or commit to an answer.

expensive · slow SOTA on reasoning
STRATEGY 05

Best-of-N + Verifier

Generate N candidate answers, score each with a learned reward model or verifier, return the highest-scoring one. Effective when verification is easier than generation.

N+1 calls + quality floor
STRATEGY 06

Process Reward Models

Score not just the final answer but each intermediate reasoning step. Allows beam-search-style exploration where the model commits to good steps and abandons bad ones.

requires PRM step-level signal

Three Concepts at a Glance

The frontier of production LLM systems — what to put in, what to trust, and how hard to think.

CONCEPT
PROBLEM IT SOLVES
KEY TECHNIQUES
WHEN IT MATTERS
📐 Context Engineering
Limited window, position bias, cost of long prompts
Sliding window · summarization · selective retrieval · prompt caching · KV-cache reuse
Long conversations, RAG-heavy apps, cost-sensitive deployments
🪞 Hallucination & Grounding
Models confidently fabricate plausible-sounding falsehoods
Forced citations · NLI verification · abstention · faithfulness scoring · grounded decoding
High-stakes domains: legal, medical, finance, anywhere truth matters
🧮 Test-Time Compute
One-shot generation isn't enough for hard reasoning
CoT · self-consistency · Tree of Thoughts · reasoning models · Best-of-N + verifier
Math, code, multi-step planning, anywhere a wrong answer is worse than a slow one