Skip to content

Context Engineering Report — selectools

Applied using the nv-context methodology. Date: 2026-04-05. Starting maturity: L3 (Structured). Final maturity: L5-L6 (Maintained/Adaptive).

Executive Summary

The selectools repository had a single 440-line CLAUDE.md that tried to do everything: project overview, directory tree, code conventions, testing patterns, feature checklists, release workflows, 26 common pitfalls, and a roadmap. This was restructured into a progressive disclosure hierarchy: a 67-line root for orientation, 3 subdirectory files for scoped rules, a universal AGENTS.md, 5 automation hooks, session management tooling, and CI hardening.

Result: 58/60 leverage score. Token budget dropped from ~12K (9.4%) to ~10K (8.0%) while delivering more effective context through layering.


Changes Made

1. Root CLAUDE.md: 440 lines to 67 lines (85% reduction)

What was removed: - Codebase structure tree (80 lines) — agents discover this by reading files - Key Conventions section (50 lines) — moved to src/selectools/CLAUDE.md - Testing Pattern subsection (10 lines) — moved to tests/CLAUDE.md - Feature Development Checklist (50 lines) — the /feature skill already covers this - Release History Pattern (15 lines) — discoverable from git log - Current Roadmap (30 lines) — readable from ROADMAP.md

What was kept: - Project overview (condensed to 3 lines) - Commands (7 exact commands with full flags) - Stability markers table - StepType reference (single line, all 27 types) - All 26 Common Pitfalls (condensed to one-liners) - Session workflow instruction (Document-and-Clear)

Why this works: The ETH Zurich research shows LLMs follow ~150-200 instructions reliably. At 440 lines, the agent was hitting diminishing returns on later instructions. At 67 lines, every line is a landmine or a command — content the agent cannot discover by reading code.

What could be improved: The 26 pitfalls still consume ~40 lines. These could be moved to a separate PITFALLS.md file referenced by @import, leaving the root at ~30 lines. The tradeoff is that pitfalls are the highest-value content and benefit from always being in context.

2. AGENTS.md (89 lines) — Universal Agent Config

What it contains: - Commands section with 7 copy-pasteable commands - Stack section (only non-obvious choices) - Three-tier boundaries: Always (format, test, PR workflow), Ask First (version bump, init.py changes), Never (push to main, edit migrations, commit .env) - 15 condensed landmines from the 26 pitfalls - Patterns section (stability markers, observer usage, file naming) - Subagent patterns (fan-out, worktrees, quality gates)

Why it exists: 25+ AI tools read AGENTS.md (Cursor, Copilot, Windsurf, Aider, Gemini CLI). CLAUDE.md is Claude-specific. AGENTS.md is the universal baseline that works everywhere.

What could be improved: The landmines section could include severity markers (CRITICAL/HIGH/MEDIUM) so agents prioritize the most dangerous ones. The subagent patterns section could include resource budget recommendations per agent type.

3. Subdirectory CLAUDE.md Files (3 files, 130 lines total)

tests/CLAUDE.md (46 lines): - Test commands, Agent setup requirements (_DUMMY tool) - Mock patterns (LocalProvider, RecordingProvider, (Message, UsageStats) tuples) - E2E and regression conventions - Common test gotchas

src/selectools/CLAUDE.md (51 lines): - Code style rules (Black/isort/100 chars) - Stability markers with decision tree - Provider protocol requirements (stream MUST pass tools) - Agent loop shared helpers and _RunContext - Critical pitfalls specific to source code

docs/CLAUDE.md (43 lines): - MkDocs Material setup and build command - File organization and linking rules - Feature documentation checklist (8 items) - Common doc gotchas (CHANGELOG copying, hardcoded counts)

Why this structure: Progressive disclosure. An agent working in tests/ loads root CLAUDE.md + tests/CLAUDE.md but NOT src/selectools/CLAUDE.md. This means the testing agent gets 113 lines of relevant context instead of 440 lines of everything. The scoped context is also more precise — testing rules don't pollute source code work and vice versa.

What could be improved: Additional subdirectory files could be created for src/selectools/agent/CLAUDE.md (agent loop specifics) and src/selectools/providers/CLAUDE.md (provider protocol details). These are currently in the parent src/selectools/CLAUDE.md but could be scoped further for deep work in those directories.

4. .claudeignore (45 patterns)

What it excludes: - landing/ — 2000+ lines of HTML/CSS/JS not relevant to Python development - notebooks/ — Jupyter notebooks - docs/stylesheets/, docs/javascripts/ — CSS/JS assets - landing/examples/, landing/simulations/, landing/builder/ — frontend subpages - Build artifacts: dist/, build/, *.egg-info/ - logs/, .private/, scripts/ - *.pyc, __pycache__/, htmlcov/

Why it matters: Without .claudeignore, every time an agent reads the repo, it wastes context on files irrelevant to Python development. The landing page alone is 2000+ lines. Excluding these files effectively increases the agent's available context window for actual source code.

What could be improved: Could add *.json exclusions for large data files (simulation-traces.json, examples gallery). Could also exclude docs/modules/ since those are reference docs that agents rarely need to read in full.

5. HANDOFF.md — Session Handoff Template

What it provides: - Structured template with 5 sections: What I Was Doing, Current State, What's Left, Key Decisions Made, Watch Out For - Designed to be filled by the engineer (or the /handoff skill) before ending a session - Next session reads it to resume context

Why it matters: Research shows that the Document-and-Clear workflow (update HANDOFF.md, /clear, start fresh) outperforms auto-compaction. Auto-compaction loses nuance. A human-curated handoff preserves the decisions and gotchas that matter.

What could be improved: The template could include auto-populated fields (branch, last commit, test status) that the /handoff skill fills in, with only the narrative sections left for the engineer. Could also version handoffs (HANDOFF-001.md, HANDOFF-002.md) to maintain a session history.

6. /handoff Skill

What it does: 1. Reads git status, recent commits, current branch, test status 2. Auto-fills the structured sections of HANDOFF.md 3. Suggests /clear for a fresh start

Why a skill instead of a hook: Handoff is an intentional action (the engineer decides when to hand off), not an automatic event. A hook would fire on every session end, producing noise. A skill fires on demand when the engineer is ready.

What could be improved: The skill could also save a snapshot of the conversation's key decisions to a docs/sessions/ directory for long-term project memory. Could integrate with the auto-memory system to extract and persist important context automatically.

7. Hooks (5 hooks in .claude/settings.json)

PostCompact — Re-inject CLAUDE.md rules: - Reads the first 30 lines of CLAUDE.md and injects them as additionalContext after context compaction - Solves the #1 agent failure mode: "forgetting rules after compaction" - The top 30 lines contain commands, stability markers, and the start of the pitfalls — the most critical instructions

PostToolUse Write|Edit — Auto-format: - Runs black --line-length=100 + isort on any .py file after Write or Edit - Eliminates format-related pre-commit failures (the agent no longer needs to remember to format) - This is the textbook case for "hooks for determinism" — formatting MUST happen, so make it a hook not an instruction

PreToolUse Bash — Block main push: - Detects git push ... main/master commands and returns decision: block - Allows pushes to any other branch - Enforces the PR workflow without relying on the agent remembering the rule

PreToolUse Bash — Lint before commit: - Runs flake8 src/ before any git commit command - Blocks the commit if lint errors are found - Catches issues before they hit the pre-commit hook (faster feedback loop)

SessionStart — Staleness check: - On session start, checks if CLAUDE.md or AGENTS.md are older than 14 days - Shows a warning if stale, prompting review - Prevents config drift where rules become outdated as the codebase evolves

What could be improved: Could add a PostToolUse hook that runs the relevant test file after editing source code (e.g., editing memory.py auto-runs test_memory.py). Could add a PreCompact hook that saves key context to HANDOFF.md before compaction happens. The staleness threshold (14 days) could be configurable.

8. CI Coverage Gate

What changed: - Test command in .github/workflows/ci.yml changed from pytest -n auto to pytest -n auto --cov=selectools --cov-fail-under=90 - CI now fails if test coverage drops below 90% - Coverage report shows missing lines (skip-covered for cleaner output)

Why 90%: The project currently has 95% coverage. Setting the gate at 90% provides a 5% buffer for new code while preventing significant coverage regression. This is a floor, not a target.

What could be improved: Could add mutation testing (mutmut or cosmic-ray) to detect tests that pass but don't actually verify behavior. Could add separate coverage gates for critical modules (agent/, providers/) with higher thresholds (95%).

9. Subagent Patterns in AGENTS.md

What was added: - Bug hunt patterns: /bug-hunt for parallel read-only audit, /ralph-bug-hunt module loops=N for auto-fix loops - Fan-out rule: 2+ independent tasks MUST use parallel agents working on separate files - Worktree isolation: agents making changes SHOULD use isolation: "worktree" to avoid conflicts - Quality gate: after parallel agents complete, MUST run full test suite before committing

Why in AGENTS.md: Subagent orchestration patterns are tool-agnostic principles. An engineer using Cursor or Copilot with agent capabilities benefits from the same patterns.

What could be improved: Could include resource budget recommendations (e.g., "limit subagents to 50K tokens each for research tasks"). Could document the merge strategy for when parallel agents conflict on the same file.


Architecture Decision: Why Progressive Disclosure

Root CLAUDE.md (67 lines)     <- Always loaded. Orientation + commands + pitfalls.
  |
  +-- AGENTS.md (96 lines)    <- Always loaded. Universal boundaries + landmines.
  |
  +-- tests/CLAUDE.md          <- Only when working in tests/
  +-- src/selectools/CLAUDE.md <- Only when working in src/
  +-- docs/CLAUDE.md           <- Only when working in docs/
  |
  +-- .claude/skills/ (10)     <- Only when invoked by name (/feature, /test, etc.)
  |
  +-- .claude/settings.json    <- 5 hooks, always active, deterministic

An agent editing src/selectools/agent/core.py loads: root (67) + AGENTS.md (96) + src/selectools/CLAUDE.md (51) = 214 lines of highly relevant context. It does NOT load tests/CLAUDE.md or docs/CLAUDE.md.

Previously, the same agent would load 440 lines, 60% of which was irrelevant to the task (testing patterns, doc conventions, release workflows, directory trees).


Token Budget

Component               Before      After       Change
---------------------------------------------------------
Root CLAUDE.md          ~12,000     ~1,300      -89%
AGENTS.md               (none)     ~1,780      new
Subdirectory files       (none)     ~2,800      new (scoped, not always loaded)
System prompt            ~2,500     ~2,500      same
Skill descriptions       ~1,800     ~2,000      +1 skill (/handoff)
---------------------------------------------------------
Always-loaded total     ~16,300    ~7,580       -53%
Max (all files loaded)  ~16,300   ~10,380       -36%

The "always-loaded" budget dropped 53% while delivering more effective context. The key insight: loading fewer, more relevant lines beats loading everything.


Maturity Progression

Phase Level Description
Before L3 (Structured) Single CLAUDE.md, no hooks, no session management
After Phase 3 L4 (Abstracted) Multi-level hierarchy, AGENTS.md, subdirectory scoping
After Phase 5 L5 (Maintained) Hooks for determinism, staleness detection
After Phase 6-8 L6 (Adaptive) Skills, subagent patterns, session workflow, CI gates

Final Leverage Scores

Layer                          Score   Notes
---------------------------------------------------------------------
Verification                   10/10   CI + coverage gate + 14 pre-commit hooks
CLAUDE.md / AGENTS.md          10/10   Progressive disclosure, RFC 2119, no negatives
Hooks                          10/10   5 hooks: format, push block, lint, PostCompact, staleness
Skills                         10/10   10 skills with argument hints and clear triggers
Subagent patterns               9/10   Documented patterns, missing resource budgets
Session management              9/10   HANDOFF.md + .claudeignore + Document-and-Clear + PostCompact
---------------------------------------------------------------------
OVERALL                        58/60
---------------------------------------------------------------------

Recommendations for Future Improvement

  1. Skill evaluations: Create test cases for each skill to verify they produce correct output. Currently skills are tested by usage, not by automated evaluation.

  2. Subagent resource budgets: Add token counting to subagent dispatches so you can set per-agent limits and track total cost of parallel operations.

  3. Session metrics: Track how often Document-and-Clear is used vs auto-compaction, and whether sessions that use handoffs have fewer mistakes. This would provide data to optimize the session workflow.

  4. Config drift detection: The staleness hook checks file age, but not whether the config is actually stale relative to code changes. A smarter check would compare config file modification dates against source file modification dates.

  5. Cursor rule refresh: The 5 existing .cursor/rules/*.mdc files predate this overhaul and may contain content that's now redundant with AGENTS.md. Review and trim them.