Product Requirements — gtm-research → GTM Orchestrator

A source-verified comparison of kkrlstrm/gtm-research against LeadGrow's current research & enrichment execution — on cost and on outcome — and a concrete plan for whether/how to integrate it to enhance the Orchestrator. Verified against cloned source on 2026-06-12.

gtm-research LOC
~2.5k
Search rungs
4 free→gated
Cache TTL
30d / 90d dead
Verify pass
Yes
Test suite
None
Conn pooling
None

01Problem statement

LeadGrow's research/enrichment is paid-API-first and stateless across runs at the web-content level: every campaign re-fetches and re-pays for facts about the same companies, and enrichment outputs aren't structurally verified against a source. We want to know (1) does gtm-research's free-first + cached + verified approach materially cut cost and raise accuracy, and (2) can it plug into the Orchestrator without destabilizing it.

02Current state — how we execute research today

● LeadGrow research / enrichment

  • Rich enrichment: Clay CLI, AI Ark people-find, email waterfall (IcyPeas → LeadMagic → Kitt), MillionVerifierstages/enrichments · waterfall.py
  • 12-spec enrichment registry (clay | runtime LLM | apify); per-stage + per-lead cost attribution (Python)
  • Signal-bank deep-enrichment + Nexus cross-run memory feeding copy
  • No free-first web-search waterfall — enrichment goes to paid providers/Clay directly
  • No shared web-content cache — facts about a company are re-fetched per campaign, per client
  • No structural source-verify pass — enrichment LLM output isn't re-checked against the cited URL
  • Standalone scrapers (spider-llm, crunchbase-scraper, custom) live outside the pipeline, uncached

● gtm-research

  • Free-first cost waterfall (search + fetch), fail-closed by configconfig/research-waterfall.yaml · waterfall.py
  • Shared Postgres cache, client-agnostic, 30d TTL + 90d dead-URL negative cache
  • Two-agent verify pass + three-state field discipline (verified / not-found / unverified-guess)
  • Per-rung telemetry + cost-split views; watchdog pauses if verified-rate < 0.55
  • Enrichment depth is shallow vs ours — it's a research engine, not a full enrichment/launch stack

Scope note: our standalone scraper repos (spider-llm-scraper, crunchbase-scraper, etc.) were not deep-read for this PRD — the comparison target is the Orchestrator's in-pipeline research/enrichment, which is where integration would land.

03gtm-research, verified

The mechanism, confirmed against source — this is what we'd actually be adopting.

Cost waterfall — two independent chains, stop at first success

# SEARCH waterfall — config/research-waterfall.yaml:56-88 keyword (free: ddg → brave → claude_cli → serper) exa (free, semantic only) tavily (1 credit) parallel (~$0.005, gated, off by default) # FETCH waterfall — config/research-waterfall.yaml:90-122 native http jina r.jina.ai (JS-shell fallback) tavily extract parallel + digest-compress >8k chars (cheap model) # fail-closed: unknown predicate atom → SKIP. a YAML typo can NEVER fire a paid rung.

Shared cache + verify pass

04Cost comparison

LeverLeadGrow todaygtm-researchDelta
First-touch searchPaid provider / Clay creditsFree tiers first (ddg/brave/serper ~free quota)Large on high-volume discovery
Repeat facts across runs/clientsRe-fetched, re-paid every campaignShared cache, $0 on hit (client-agnostic)Largest single lever
Page fetchProvider/scraper per callnative → jina (free) before any paid extractMaterial
Long pagesFull content into contextdigest-compress on a cheap modelToken savings
Cost visibilityPython rate-cards; TS path reports $0 (bug)Per-rung credits + USD + cost-split viewsThey're more honest at runtime

The repo's own "88% free / 11% tavily / 1% parallel" split is documented as illustrative, not benchmarked — so treat absolute savings as unproven until we shadow-run it. But the structural levers (free-first ordering + a shared cache on client-agnostic web facts) are real and aimed squarely at our most repeated spend.

05Outcome comparison

Where gtm-research raises outcome quality

  • Source-verified facts. The verify-pass + three-state discipline structurally prevents LLM-guessed values from entering output as facts — directly relevant to enrichment fields we currently trust unchecked.
  • Watchdog catches silent degradation (verified-rate collapse) mid-run — we have no equivalent.
  • Primary-source bias (entity's own site over aggregators) → cleaner firmographics.

Where we still win on outcome

  • Enrichment breadth + people/email/phone waterfalls far exceed gtm-research's scope.
  • Nexus cross-run memory already improves copy from past performance — gtm-research has no learning loop.
  • Row-conservation + scorecard guard list integrity downstream of research.

Net: they make individual facts more trustworthy and cheaper to obtain; we make the campaign more complete and self-improving. Complementary, not competitive — which is the case for integration.

06Can they integrate? — yes, as a cached research provider

gtm-research already advertises a "drop-in cached upgrade for gtm-pipeline." The same seam fits the Orchestrator: it becomes an optional research/enrichment provider behind our existing registry, sharing our database.

# proposed wiring Orchestrator enrich stage enrichment registry provider: "research-engine" gtm-research entity-research workflow (free-first waterfall + shared cache + verify) returns {fields, rows[ {value, source_url, verified, note} ], summary} map verified rows into global_companies / enrichments (Supabase) # DSN: RESEARCH_DATABASE_URL → falls back to DATABASE_URL → research schema co-locates # cache is shared across every client + campaign that points at the same DB

Two integration depths: (a) shadow — run gtm-research alongside an existing enrichment, compare verified output + cost, write nothing to the pipeline; (b) provider — register as a first-class enrichment spec whose verified fields flow into enrichments/global_companies. Start with (a).

07Requirements

FR-1MustCached research provider behind the registry

Wrap the entity-research workflow as an enrichment spec (research-engine) returning our canonical enrichment shape; verified fields only are persisted.

FR-2MustShadow mode first

A run mode that executes research alongside the current provider, logs cost + verified-rate + field agreement, and persists nothing to campaign state.

NFR-1MustConnection pooling before any prod path

Replace per-query psycopg2 connect/close with a pool. The current pattern (concurrent fan-out, one connection per write) re-creates the exact Railway proxy pool-saturation failure mode from the 2026-05-21 incident. Never run it against maglev.proxy.rlwy.net.

NFR-2MustCache on the Olympus/Orchestrator DB, not the Railway proxy

Point RESEARCH_DATABASE_URL at a directly-reachable Postgres (local olympus DB or a dedicated instance), honoring the dual-DB topology + read-only-Railway rules.

FR-3ShouldFix telemetry idempotency

entity_telemetry_upsert accumulates counters — a retry double-counts credits/USD. Guard before trusting cost numbers.

FR-4ShouldVerified-field mapping contract

Define how verified=false / unverified-guess fields are handled downstream (drop vs flag vs operator review) so unverified data never silently enters a campaign.

NFR-3ShouldResolve the claude_cli "free" rung

Confirm whether the claude -p search subprocess consumes API credits / plan limits before relying on it as a free tier.

FR-5CouldReuse the cache for standalone scrapers

Route spider-llm / crunchbase scraping through the same shared cache + dead-URL negative cache to cut repeat fetches agency-wide.

FR-6CouldAdopt the fail-closed waterfall grammar elsewhere

The closed predicate grammar (no eval, unknown atom → skip) is a clean pattern for Olympus skill/provider dispatch generally.

08Phased rollout

1
Shadow eval ~1 wk

Stand up gtm-research against the olympus DB (pooled). Pick 1–2 live campaigns; run research in shadow on the same companies. Measure: cost/company, cache hit-rate after warm-up, verified-rate, and field agreement vs current enrichment. Decision gate: real savings + acceptable accuracy?

2
Pooling + idempotency hardening ~3–4 d

Add connection pooling, fix telemetry double-count, add a minimal test suite around cache hit/miss + waterfall predicate + verify-pass. Prereq for any write path.

3
Provider registration ~1 wk

Register research-engine as an enrichment spec; verified fields flow into global_companies/enrichments. Gate behind a brief flag; keep current providers as fallback (waterfall).

4
Cache-sharing expansion later

Route signal-bank deep-enrichment and standalone scrapers through the shared cache. Agency-wide repeat-fetch savings.

09Risks & blockers

RiskSeverityMitigation
Zero automated tests in the repoHighAdd coverage in Phase 2 before any write path; ship behind a flag
No resume/checkpoint on long runsMedCache makes re-runs cheap; add entity-level skip on re-invoke
Telemetry idempotency (double-count on retry)MedFR-3; don't trust cost numbers until fixed
Shared cache = stale factsMed30d TTL is sane for firmographics; tune per field class; dead-URL negative cache already handled
"Free" claude_cli rung may billLow–MedNFR-3; can disable that rung and lean on ddg/brave/serper
Savings unproven (illustrative numbers)LowPhase 1 shadow eval produces real figures before commitment

10Open decisions for Mitchell

1. DB home for the shared cache. Co-locate the research schema in the local olympus DB, or stand up a dedicated Postgres? (Affects cross-client cache reach + the read-only-Railway constraint.)
2. Unverified-field policy. When a fact comes back unverified-guess, do we drop it, flag it for operator review, or allow it with a confidence marker?
3. Build vs. adopt the engine. Wrap kkrlstrm/gtm-research directly (and own the hardening), or re-implement the three patterns natively inside the Orchestrator's enrichment core? (Adopt is faster; native is cleaner long-term and avoids inheriting the un-tested code.)
4. Scope of cache sharing. Research-only, or also route standalone scrapers + signal-bank through it for agency-wide savings?

Verification provenance. kkrlstrm/gtm-research was shallow-cloned and read in full (commit 479f65c); the Orchestrator's research/enrichment was read read-only on this machine. Cost/outcome claims are derived from source mechanics, not benchmarks — the repo's own savings figures are documented as illustrative, which is why Phase 1 is a measured shadow eval. The Railway pooling risk is cross-referenced to the documented 2026-05-21 incident.

LeadGrow GTM — research integration PRD · 2026-06-12