Nubaeon/empirica
Summary
Empirica is a Python CLI tool and MCP server that wraps AI coding sessions (primarily Claude Code) with an 'epistemic measurement' layer — tracking confidence vectors, forcing investigation before code edits via a Sentinel gate, and persisting findings/unknowns/dead-ends across sessions in SQLite. It injects structured self-assessment prompts into AI workflows and surfaces a real-time statusline showing confidence scores. Think of it as a structured metacognition harness for AI agents.
Great for
people interested in AI reliability tooling, specifically instrumenting LLM coding agents with structured confidence tracking, persistent cross-session memory, and gating mechanisms that prevent premature code edits
Easy wins
- +Add a GitHub Actions CI workflow — the Makefile already has `make ci` defined, pyproject.toml has full pytest/ruff/pyright config, but there's no .github/workflows/*.yml file at all
- +Write tests for the MCP server tool definitions in empirica-mcp/empirica_mcp/server.py — the 100+ tool definitions are completely untested based on the file tree showing tests/ exists but no mcp-specific test files
- +Fix the duplicate/inconsistent module structure: there are both empirica/integration/ and empirica/integrations/ directories, suggesting a refactor was started but not completed
- +Add type annotations to the bare except clauses throughout workflow_commands.py — there are dozens of `except Exception: pass` blocks with no logging or error context
Red flags
- !Single commit history despite claiming v1.6.7 with extensive development history — this is almost certainly a squashed or re-initialized repo, making it impossible to understand the actual development history or evaluate commit hygiene
- !docker-compose.yml hardcodes a personal home directory path: `/home/yogapad/.empirica:/root/.empirica:ro` — this is a developer's machine-specific path committed to the repo
- !pyproject.toml references CVE fixes with future dates: 'CVE-2026-27205', 'CVE-2026-27199', 'CVE-2026-24049' — CVEs with 2026 dates in a 2025/2026 repo are suspicious and suggest fabricated security justifications for dependency versions
- !No CI/CD pipeline despite having full test infrastructure configured — the tests may not actually pass
- !contributor_count shows 6 in metadata but contributor_count in the actual analysis shows 1 — suggests the GitHub API data may be inflated or forks are being counted
- !The README claims 'emerged from 600+ real working sessions' but there is 1 commit — no way to verify this claim
- !empirica/config/mcp_security.yaml exists but the security model in docker-compose.yml hardcodes credentials paths and mounts the entire project directory read-only into containers, which is a questionable security boundary
Code quality
The architecture is genuinely thoughtful — repository pattern in session_database.py, dialect-aware schema adaptation, lazy-loading to avoid circular imports, and a proper migration runner. However, workflow_commands.py has ~15 silent `except Exception: pass` blocks that swallow errors with no observability, the `_auto_bootstrap` function calls subprocess('empirica') creating a self-referential process spawn that could fail silently in many environments, and there are deprecated methods (log_preflight_assessment, log_check_phase_assessment) still present with full implementations rather than just raising DeprecationWarning. The docs_commands.py shows good separation but uses bare `except Exception: pass` throughout AST parsing without logging.
What makes it unique
The core concept — adding structured epistemic state tracking (confidence vectors, Sentinel gates, noetic/praxic phase separation) as a middleware layer over AI coding agents — is genuinely novel and not a clone of anything obvious. Most AI reliability tooling focuses on output validation rather than mid-session confidence gating. However, the practical value depends entirely on whether LLMs actually self-assess accurately when prompted by this system, which is an open research question the README doesn't address. The 13-vector system is interesting but appears to be empirically derived by one developer rather than grounded in published research.
Scores
Barrier to entry
highThe commit history shows exactly 1 commit (despite claiming v1.6.7 with 600+ sessions of development), no CI pipeline, 0 good-first-issues, and the codebase requires understanding a multi-layered proprietary framework (CASCADE, Sentinel, noetic/praxic split, 13 epistemic vectors) before any contribution makes sense — there's no onboarding path for contributors beyond the end-user docs.