A source-verified comparison of kkrlstrm/gtm-research against LeadGrow's current research & enrichment execution — on cost and on outcome — and a concrete plan for whether/how to integrate it to enhance the Orchestrator. Verified against cloned source on 2026-06-12.
LeadGrow's research/enrichment is paid-API-first and stateless across runs at the web-content level: every campaign re-fetches and re-pays for facts about the same companies, and enrichment outputs aren't structurally verified against a source. We want to know (1) does gtm-research's free-first + cached + verified approach materially cut cost and raise accuracy, and (2) can it plug into the Orchestrator without destabilizing it.
Scope note: our standalone scraper repos (spider-llm-scraper, crunchbase-scraper, etc.) were not deep-read for this PRD — the comparison target is the Orchestrator's in-pipeline research/enrichment, which is where integration would land.
The mechanism, confirmed against source — this is what we'd actually be adopting.
research schema; keys are sha256(normalize_url) / sha256(rung|normalize_query) (strips utm/gclid). scope: shared — "web facts are client-agnostic, one cache across all runs." Dead URLs (401/404/410) negative-cached 90d. research_db.py:110-162source_url raw (--no-digest) and confirms or blanks every field. Three states: verified=true+source / ""+NOT FOUND / guess+UNVERIFIED. decide_what_to_fetch and verify_claim are flagged non-delegable to cheap models. entity-research.js:127-188| Lever | LeadGrow today | gtm-research | Delta |
|---|---|---|---|
| First-touch search | Paid provider / Clay credits | Free tiers first (ddg/brave/serper ~free quota) | Large on high-volume discovery |
| Repeat facts across runs/clients | Re-fetched, re-paid every campaign | Shared cache, $0 on hit (client-agnostic) | Largest single lever |
| Page fetch | Provider/scraper per call | native → jina (free) before any paid extract | Material |
| Long pages | Full content into context | digest-compress on a cheap model | Token savings |
| Cost visibility | Python rate-cards; TS path reports $0 (bug) | Per-rung credits + USD + cost-split views | They're more honest at runtime |
The repo's own "88% free / 11% tavily / 1% parallel" split is documented as illustrative, not benchmarked — so treat absolute savings as unproven until we shadow-run it. But the structural levers (free-first ordering + a shared cache on client-agnostic web facts) are real and aimed squarely at our most repeated spend.
Net: they make individual facts more trustworthy and cheaper to obtain; we make the campaign more complete and self-improving. Complementary, not competitive — which is the case for integration.
gtm-research already advertises a "drop-in cached upgrade for gtm-pipeline." The same seam fits the Orchestrator: it becomes an optional research/enrichment provider behind our existing registry, sharing our database.
Two integration depths: (a) shadow — run gtm-research alongside an existing enrichment, compare verified output + cost, write nothing to the pipeline; (b) provider — register as a first-class enrichment spec whose verified fields flow into enrichments/global_companies. Start with (a).
Wrap the entity-research workflow as an enrichment spec (research-engine) returning our canonical enrichment shape; verified fields only are persisted.
A run mode that executes research alongside the current provider, logs cost + verified-rate + field agreement, and persists nothing to campaign state.
Replace per-query psycopg2 connect/close with a pool. The current pattern (concurrent fan-out, one connection per write) re-creates the exact Railway proxy pool-saturation failure mode from the 2026-05-21 incident. Never run it against maglev.proxy.rlwy.net.
Point RESEARCH_DATABASE_URL at a directly-reachable Postgres (local olympus DB or a dedicated instance), honoring the dual-DB topology + read-only-Railway rules.
entity_telemetry_upsert accumulates counters — a retry double-counts credits/USD. Guard before trusting cost numbers.
Define how verified=false / unverified-guess fields are handled downstream (drop vs flag vs operator review) so unverified data never silently enters a campaign.
claude_cli "free" rungConfirm whether the claude -p search subprocess consumes API credits / plan limits before relying on it as a free tier.
Route spider-llm / crunchbase scraping through the same shared cache + dead-URL negative cache to cut repeat fetches agency-wide.
The closed predicate grammar (no eval, unknown atom → skip) is a clean pattern for Olympus skill/provider dispatch generally.
Stand up gtm-research against the olympus DB (pooled). Pick 1–2 live campaigns; run research in shadow on the same companies. Measure: cost/company, cache hit-rate after warm-up, verified-rate, and field agreement vs current enrichment. Decision gate: real savings + acceptable accuracy?
Add connection pooling, fix telemetry double-count, add a minimal test suite around cache hit/miss + waterfall predicate + verify-pass. Prereq for any write path.
Register research-engine as an enrichment spec; verified fields flow into global_companies/enrichments. Gate behind a brief flag; keep current providers as fallback (waterfall).
Route signal-bank deep-enrichment and standalone scrapers through the shared cache. Agency-wide repeat-fetch savings.
| Risk | Severity | Mitigation |
|---|---|---|
| Zero automated tests in the repo | High | Add coverage in Phase 2 before any write path; ship behind a flag |
| No resume/checkpoint on long runs | Med | Cache makes re-runs cheap; add entity-level skip on re-invoke |
| Telemetry idempotency (double-count on retry) | Med | FR-3; don't trust cost numbers until fixed |
| Shared cache = stale facts | Med | 30d TTL is sane for firmographics; tune per field class; dead-URL negative cache already handled |
| "Free" claude_cli rung may bill | Low–Med | NFR-3; can disable that rung and lean on ddg/brave/serper |
| Savings unproven (illustrative numbers) | Low | Phase 1 shadow eval produces real figures before commitment |
research schema in the local olympus DB, or stand up a dedicated Postgres? (Affects cross-client cache reach + the read-only-Railway constraint.)unverified-guess, do we drop it, flag it for operator review, or allow it with a confidence marker?Verification provenance. kkrlstrm/gtm-research was shallow-cloned and read in full (commit 479f65c); the Orchestrator's research/enrichment was read read-only on this machine. Cost/outcome claims are derived from source mechanics, not benchmarks — the repo's own savings figures are documented as illustrative, which is why Phase 1 is a measured shadow eval. The Railway pooling risk is cross-referenced to the documented 2026-05-21 incident.
LeadGrow GTM — research integration PRD · 2026-06-12