sherlock-project/sherlock

Python78,6099,184230 issues310 contributorsMIT

Summary

Sherlock is a CLI OSINT tool that searches for a given username across 400+ social media sites simultaneously using concurrent HTTP requests. It checks each site using one of three detection methods (HTTP status code, error message in response body, or redirect URL), then reports which platforms have an account registered under that username. Output can be saved as CSV, Excel, or text files.

Great for

People interested in OSINT, web scraping reliability challenges, and maintaining large data-driven site manifests — specifically the cat-and-mouse game of keeping detection heuristics accurate against WAFs, site redesigns, and anti-bot measures

Easy wins

+Fix false positives/negatives flagged in open issues: the issue tracker has dedicated 'false positive' and 'false negative' labels and templates, and fixes are just JSON edits to data.json with a regex or errorMsg tweak
+Add WAF fingerprint strings to the WAFHitMsgs list in sherlock.py — the existing list (lines ~280-285) only has 4 entries from 2024, and new WAF signatures are regularly needed
+Write or improve tests in tests/few_test_basic.py, which appears sparse based on its name — the test infrastructure (pytest, conftest, xdist) is already set up
+Add new site entries: there's a dedicated 'site-request' issue template and a GitHub Action (validate_modified_targets.yml) that auto-validates manifest changes, so the contribution pipeline is clear

Red flags

!SiteInformation.__init__ in sites.py accepts username_unclaimed as a parameter but unconditionally overwrites it on the next line — the parameter is dead code and potentially misleading
!globvar in notify.py is a module-level global counter that isn't reset between runs — if sherlock() is called multiple times in the same process (e.g., as a library), the count accumulates incorrectly
!The Dockerfile has placeholder comments 'CHANGE ME ON UPDATE' for VCS_REF and VERSION_TAG with empty defaults — easy to ship a misconfigured image
!No CONTRIBUTING.md exists despite 310 contributors — onboarding is ad-hoc and relies on issue templates alone
!Default behavior fetches data.json from GitHub master at runtime rather than using the bundled local copy, meaning every run makes a live network call to GitHub before doing anything — this is a silent latency hit and a failure mode if GitHub is unreachable
!pandas and openpyxl are heavyweight dependencies for what amounts to writing a CSV/Excel file — adds significant install weight for an optional output format

Code quality

decent

The core sherlock() function in sherlock.py is a single ~200-line function that does request dispatch and result parsing in one pass — it works but would benefit from decomposition. Error handling is thorough for network exceptions. There's a notable design smell in sites.py: SiteInformation.__init__ accepts a username_unclaimed parameter but immediately overwrites it with secrets.token_urlsafe(32), making the parameter pointless dead code. The global variable 'globvar' in notify.py for counting results is a genuine code smell that would break under concurrent multi-username searches. The WAFHitMsgs list is a raw list of long strings with inline comments — functional but fragile and unmaintainable at scale.

What makes it unique

Sherlock is genuinely the canonical tool in this space — it predates and has more site coverage than most competitors (Maigret, etc.), and its 78k stars reflect real adoption. The challenge isn't uniqueness but maintenance: the site manifest rots constantly as sites change their error pages, add WAFs, or shut down. This is more of an ongoing curation project than a software engineering one at this point.

Discussion

No comments yet. Be the first to share your thoughts.