cocoindex-io/cocoindex-code
Summary
cocoindex-code is a CLI tool and MCP server that builds a local semantic search index over your codebase using AST-based chunking and local embeddings (sentence-transformers/all-MiniLM-L6-v2). It runs a background daemon process that manages indexing and serves search queries, integrating with Claude Code, Codex, and other MCP-compatible coding agents to reduce token usage by feeding agents only relevant code chunks instead of entire files.
Great for
building local-first semantic code search tooling for AI coding agents, specifically people interested in the intersection of AST parsing, vector embeddings, daemon architecture, and MCP protocol integration
Easy wins
- +Add language support: the README mentions AST-based chunking but the actual language list isn't visible in the samples — adding support for a new language (Go, Rust, etc.) via the indexer.py would be a concrete, bounded contribution
- +Add 'help-wanted' or 'good-first-issue' labels to the 15 open issues — currently none are labeled, making it hard for newcomers to find entry points
- +Improve error messages when the CocoIndex daemon fails to start — currently client.py raises generic RuntimeError('Failed to connect to daemon after starting it') with no diagnostic info about why
- +Add a --json output flag to `ccc search` and `ccc status` for programmatic consumption — the data structures are already clean Pydantic models, just needs a serialization path
Red flags
- !Depends on cocoindex==1.0.0a32 (pre-release alpha) — the core indexing/embedding engine is an opaque Rust binary that contributors can't easily inspect or modify, making bug reproduction in the indexing pipeline difficult
- !commit_count: 1 and contributor_count: 1 in the metadata despite 8 listed contributors — suggests this is effectively a solo project with very recent external contributions, increasing bus factor risk
- !The 'Development Status :: 3 - Alpha' classifier in pyproject.toml conflicts with the confident '1 min setup — just works' README marketing; behavior in edge cases (large monorepos, network filesystems, Windows) may be untested
- !No rate limiting or size limits on search queries to the daemon — a malformed or extremely long query string goes straight to the embedder without validation in the visible code path
Code quality
The code is notably well-structured for a project this size. daemon.py has clean separation between ProjectRegistry (state), connection handling, and request dispatch, with proper asyncio Lock usage for indexing concurrency and explicit load-time indexing tracking via asyncio.Event. client.py handles the Windows/Unix socket differences carefully (the _pid_alive Windows workaround for a CPython bug is a real edge case handled correctly). The e2e test suite in test_e2e.py is genuinely comprehensive — testing incremental indexing, gitignore respecting, path filters, daemon restart, and subdirectory scoping. One weak spot: cli.py has `async def _bg_index(client, project_root: str) -> None: # type: ignore[no-untyped-def]` which violates the project's own strict mypy config, and the `# type: ignore[return-value]` casts in client.py are a code smell around the protocol dispatch.
What makes it unique
The combination of local-first (no API key), AST-aware chunking, daemon architecture for cross-session index reuse, and first-class MCP integration is genuinely differentiated from simple grep-based tools or cloud-dependent solutions like GitHub Copilot's code search. The closest competitor is likely tree-sitter-based indexers or local Chroma/Qdrant setups, but those require more configuration. The 'skills' agent integration pattern (npx skills add) is novel. However, the core value is heavily dependent on the closed-source CocoIndex Rust engine, which limits how deeply contributors can improve the fundamental indexing quality.
Scores
Barrier to entry
mediumThe architecture is non-trivial (daemon process + IPC + async request dispatch + streaming responses), the dependency on a pre-release Rust-backed CocoIndex engine (cocoindex==1.0.0a32) adds opacity, and there are zero good-first-issue labels despite 15 open issues — but the code is well-structured, there's a CONTRIBUTING.md, CLAUDE.md with build commands, and the test suite is comprehensive with real e2e tests.