cocoindex-io/cocoindex-code

Python8115915 issues8 contributorsApache-2.0

Summary

cocoindex-code is a CLI tool and MCP server that builds a local semantic search index over your codebase using AST-based chunking and local embeddings (sentence-transformers/all-MiniLM-L6-v2). It runs a background daemon process that manages indexing and serves search queries, integrating with Claude Code, Codex, and other MCP-compatible coding agents to reduce token usage by feeding agents only relevant code chunks instead of entire files.

Great for

building local-first semantic code search tooling for AI coding agents, specifically people interested in the intersection of AST parsing, vector embeddings, daemon architecture, and MCP protocol integration

Easy wins

+Add language support: the README mentions AST-based chunking but the actual language list isn't visible in the samples — adding support for a new language (Go, Rust, etc.) via the indexer.py would be a concrete, bounded contribution
+Add 'help-wanted' or 'good-first-issue' labels to the 15 open issues — currently none are labeled, making it hard for newcomers to find entry points
+Improve error messages when the CocoIndex daemon fails to start — currently client.py raises generic RuntimeError('Failed to connect to daemon after starting it') with no diagnostic info about why
+Add a --json output flag to `ccc search` and `ccc status` for programmatic consumption — the data structures are already clean Pydantic models, just needs a serialization path

Red flags

!Depends on cocoindex==1.0.0a32 (pre-release alpha) — the core indexing/embedding engine is an opaque Rust binary that contributors can't easily inspect or modify, making bug reproduction in the indexing pipeline difficult
!commit_count: 1 and contributor_count: 1 in the metadata despite 8 listed contributors — suggests this is effectively a solo project with very recent external contributions, increasing bus factor risk
!The 'Development Status :: 3 - Alpha' classifier in pyproject.toml conflicts with the confident '1 min setup — just works' README marketing; behavior in edge cases (large monorepos, network filesystems, Windows) may be untested
!No rate limiting or size limits on search queries to the daemon — a malformed or extremely long query string goes straight to the embedder without validation in the visible code path

Code quality

good

The code is notably well-structured for a project this size. daemon.py has clean separation between ProjectRegistry (state), connection handling, and request dispatch, with proper asyncio Lock usage for indexing concurrency and explicit load-time indexing tracking via asyncio.Event. client.py handles the Windows/Unix socket differences carefully (the _pid_alive Windows workaround for a CPython bug is a real edge case handled correctly). The e2e test suite in test_e2e.py is genuinely comprehensive — testing incremental indexing, gitignore respecting, path filters, daemon restart, and subdirectory scoping. One weak spot: cli.py has `async def _bg_index(client, project_root: str) -> None: # type: ignore[no-untyped-def]` which violates the project's own strict mypy config, and the `# type: ignore[return-value]` casts in client.py are a code smell around the protocol dispatch.

What makes it unique

The combination of local-first (no API key), AST-aware chunking, daemon architecture for cross-session index reuse, and first-class MCP integration is genuinely differentiated from simple grep-based tools or cloud-dependent solutions like GitHub Copilot's code search. The closest competitor is likely tree-sitter-based indexers or local Chroma/Qdrant setups, but those require more configuration. The 'skills' agent integration pattern (npx skills add) is novel. However, the core value is heavily dependent on the closed-source CocoIndex Rust engine, which limits how deeply contributors can improve the fundamental indexing quality.

Scores

Collab

Activity

Barrier to entry

medium

The architecture is non-trivial (daemon process + IPC + async request dispatch + streaming responses), the dependency on a pre-release Rust-backed CocoIndex engine (cocoindex==1.0.0a32) adds opacity, and there are zero good-first-issue labels despite 15 open issues — but the code is well-structured, there's a CONTRIBUTING.md, CLAUDE.md with build commands, and the test suite is comprehensive with real e2e tests.

Skills needed

Python (async/await, asyncio, dataclasses)Understanding of daemon/IPC patterns (multiprocessing.connection, Unix sockets, named pipes)Vector search fundamentals (embeddings, similarity search, sqlite-vec)MCP protocol (Model Context Protocol)Tree-sitter or AST-based code parsingCross-platform process management (Windows named pipes vs Unix sockets)