Product Requirements — gtm-research → GTM Orchestrator

A source-verified comparison of kkrlstrm/gtm-research against LeadGrow's current research & enrichment execution — on cost and on outcome — and a concrete plan for whether/how to integrate it to enhance the Orchestrator. Verified against cloned source on 2026-06-12.

gtm-research LOC

~2.5k

Search rungs

4 free→gated

Cache TTL

30d / 90d dead

Verify pass

Yes

Test suite

None

Conn pooling

None

Recommendation

Adopt the patterns, not the repo as-is. gtm-research is a genuinely well-designed cost-optimized research layer whose three ideas — a fail-closed free-first cost waterfall, a client-agnostic shared web cache, and a two-agent source-verify pass — directly attack our biggest research weaknesses (no web-content cache, paid-API-first enrichment, unverified facts). But it ships with zero tests, no resume, and an un-pooled psycopg2 layer that would re-trigger our documented Railway pool-saturation incident. Plan: wrap it as an optional cached research provider behind our enrichment registry — Phase 1 in shadow mode, never on the Railway proxy — and harden pooling + idempotency before any production path. Do not drop it in wholesale.

01Problem statement

LeadGrow's research/enrichment is paid-API-first and stateless across runs at the web-content level: every campaign re-fetches and re-pays for facts about the same companies, and enrichment outputs aren't structurally verified against a source. We want to know (1) does gtm-research's free-first + cached + verified approach materially cut cost and raise accuracy, and (2) can it plug into the Orchestrator without destabilizing it.

02Current state — how we execute research today

● LeadGrow research / enrichment

Rich enrichment: Clay CLI, AI Ark people-find, email waterfall (IcyPeas → LeadMagic → Kitt), MillionVerifierstages/enrichments · waterfall.py
12-spec enrichment registry (clay | runtime LLM | apify); per-stage + per-lead cost attribution (Python)
Signal-bank deep-enrichment + Nexus cross-run memory feeding copy
No free-first web-search waterfall — enrichment goes to paid providers/Clay directly
No shared web-content cache — facts about a company are re-fetched per campaign, per client
No structural source-verify pass — enrichment LLM output isn't re-checked against the cited URL
Standalone scrapers (spider-llm, crunchbase-scraper, custom) live outside the pipeline, uncached

● gtm-research

Free-first cost waterfall (search + fetch), fail-closed by configconfig/research-waterfall.yaml · waterfall.py
Shared Postgres cache, client-agnostic, 30d TTL + 90d dead-URL negative cache
Two-agent verify pass + three-state field discipline (verified / not-found / unverified-guess)
Per-rung telemetry + cost-split views; watchdog pauses if verified-rate < 0.55
Enrichment depth is shallow vs ours — it's a research engine, not a full enrichment/launch stack

Scope note: our standalone scraper repos (spider-llm-scraper, crunchbase-scraper, etc.) were not deep-read for this PRD — the comparison target is the Orchestrator's in-pipeline research/enrichment, which is where integration would land.

03gtm-research, verified

The mechanism, confirmed against source — this is what we'd actually be adopting.

Cost waterfall — two independent chains, stop at first success

# SEARCH waterfall — config/research-waterfall.yaml:56-88 keyword (free: ddg → brave → claude_cli → serper) → exa (free, semantic only) → tavily (1 credit) → parallel (~$0.005, gated, off by default) # FETCH waterfall — config/research-waterfall.yaml:90-122 native http → jina r.jina.ai (JS-shell fallback) → tavily extract → parallel + digest-compress >8k chars (cheap model) # fail-closed: unknown predicate atom → SKIP. a YAML typo can NEVER fire a paid rung.

Shared cache + verify pass

▪ Cache: Postgres research schema; keys are sha256(normalize_url) / sha256(rung|normalize_query) (strips utm/gclid). scope: shared — "web facts are client-agnostic, one cache across all runs." Dead URLs (401/404/410) negative-cached 90d. research_db.py:110-162
▪ Verify: a separate verify agent re-opens each source_url raw (--no-digest) and confirms or blanks every field. Three states: verified=true+source / ""+NOT FOUND / guess+UNVERIFIED. decide_what_to_fetch and verify_claim are flagged non-delegable to cheap models. entity-research.js:127-188
▪ Watchdog: trailing verified-rate < 0.55 over last 25 entities → run pauses (catches silent provider outages where pages load but facts don't verify).

04Cost comparison

Lever	LeadGrow today	gtm-research	Delta
First-touch search	Paid provider / Clay credits	Free tiers first (ddg/brave/serper ~free quota)	Large on high-volume discovery
Repeat facts across runs/clients	Re-fetched, re-paid every campaign	Shared cache, $0 on hit (client-agnostic)	Largest single lever
Page fetch	Provider/scraper per call	native → jina (free) before any paid extract	Material
Long pages	Full content into context	digest-compress on a cheap model	Token savings
Cost visibility	Python rate-cards; TS path reports $0 (bug)	Per-rung credits + USD + cost-split views	They're more honest at runtime

The repo's own "88% free / 11% tavily / 1% parallel" split is documented as illustrative, not benchmarked — so treat absolute savings as unproven until we shadow-run it. But the structural levers (free-first ordering + a shared cache on client-agnostic web facts) are real and aimed squarely at our most repeated spend.

05Outcome comparison

Where gtm-research raises outcome quality

Source-verified facts. The verify-pass + three-state discipline structurally prevents LLM-guessed values from entering output as facts — directly relevant to enrichment fields we currently trust unchecked.
Watchdog catches silent degradation (verified-rate collapse) mid-run — we have no equivalent.
Primary-source bias (entity's own site over aggregators) → cleaner firmographics.

Where we still win on outcome

Enrichment breadth + people/email/phone waterfalls far exceed gtm-research's scope.
Nexus cross-run memory already improves copy from past performance — gtm-research has no learning loop.
Row-conservation + scorecard guard list integrity downstream of research.

Net: they make individual facts more trustworthy and cheaper to obtain; we make the campaign more complete and self-improving. Complementary, not competitive — which is the case for integration.

06Can they integrate? — yes, as a cached research provider

gtm-research already advertises a "drop-in cached upgrade for gtm-pipeline." The same seam fits the Orchestrator: it becomes an optional research/enrichment provider behind our existing registry, sharing our database.

# proposed wiring Orchestrator enrich stage → enrichment registry → provider: "research-engine" → gtm-research entity-research workflow (free-first waterfall + shared cache + verify) → returns {fields, rows[ {value, source_url, verified, note} ], summary} → map verified rows into global_companies / enrichments (Supabase) # DSN: RESEARCH_DATABASE_URL → falls back to DATABASE_URL → research schema co-locates # cache is shared across every client + campaign that points at the same DB

Two integration depths: (a) shadow — run gtm-research alongside an existing enrichment, compare verified output + cost, write nothing to the pipeline; (b) provider — register as a first-class enrichment spec whose verified fields flow into enrichments/global_companies. Start with (a).

07Requirements

FR-1MustCached research provider behind the registry

Wrap the entity-research workflow as an enrichment spec (research-engine) returning our canonical enrichment shape; verified fields only are persisted.

FR-2MustShadow mode first

A run mode that executes research alongside the current provider, logs cost + verified-rate + field agreement, and persists nothing to campaign state.

NFR-1MustConnection pooling before any prod path

Replace per-query psycopg2 connect/close with a pool. The current pattern (concurrent fan-out, one connection per write) re-creates the exact Railway proxy pool-saturation failure mode from the 2026-05-21 incident. Never run it against maglev.proxy.rlwy.net.

NFR-2MustCache on the Olympus/Orchestrator DB, not the Railway proxy

Point RESEARCH_DATABASE_URL at a directly-reachable Postgres (local olympus DB or a dedicated instance), honoring the dual-DB topology + read-only-Railway rules.

FR-3ShouldFix telemetry idempotency

entity_telemetry_upsert accumulates counters — a retry double-counts credits/USD. Guard before trusting cost numbers.

FR-4ShouldVerified-field mapping contract

Define how verified=false / unverified-guess fields are handled downstream (drop vs flag vs operator review) so unverified data never silently enters a campaign.

NFR-3ShouldResolve the claude_cli "free" rung

Confirm whether the claude -p search subprocess consumes API credits / plan limits before relying on it as a free tier.

FR-5CouldReuse the cache for standalone scrapers

Route spider-llm / crunchbase scraping through the same shared cache + dead-URL negative cache to cut repeat fetches agency-wide.

FR-6CouldAdopt the fail-closed waterfall grammar elsewhere

The closed predicate grammar (no eval, unknown atom → skip) is a clean pattern for Olympus skill/provider dispatch generally.

08Phased rollout

Shadow eval ~1 wk

Stand up gtm-research against the olympus DB (pooled). Pick 1–2 live campaigns; run research in shadow on the same companies. Measure: cost/company, cache hit-rate after warm-up, verified-rate, and field agreement vs current enrichment. Decision gate: real savings + acceptable accuracy?

Pooling + idempotency hardening ~3–4 d

Add connection pooling, fix telemetry double-count, add a minimal test suite around cache hit/miss + waterfall predicate + verify-pass. Prereq for any write path.

Provider registration ~1 wk

Register research-engine as an enrichment spec; verified fields flow into global_companies/enrichments. Gate behind a brief flag; keep current providers as fallback (waterfall).

Cache-sharing expansion later

Route signal-bank deep-enrichment and standalone scrapers through the shared cache. Agency-wide repeat-fetch savings.

09Risks & blockers

Risk	Severity	Mitigation
Zero automated tests in the repo	High	Add coverage in Phase 2 before any write path; ship behind a flag
No resume/checkpoint on long runs	Med	Cache makes re-runs cheap; add entity-level skip on re-invoke
Telemetry idempotency (double-count on retry)	Med	FR-3; don't trust cost numbers until fixed
Shared cache = stale facts	Med	30d TTL is sane for firmographics; tune per field class; dead-URL negative cache already handled
"Free" claude_cli rung may bill	Low–Med	NFR-3; can disable that rung and lean on ddg/brave/serper
Savings unproven (illustrative numbers)	Low	Phase 1 shadow eval produces real figures before commitment

10Open decisions for Mitchell

1. DB home for the shared cache. Co-locate the research schema in the local olympus DB, or stand up a dedicated Postgres? (Affects cross-client cache reach + the read-only-Railway constraint.)

2. Unverified-field policy. When a fact comes back unverified-guess, do we drop it, flag it for operator review, or allow it with a confidence marker?

3. Build vs. adopt the engine. Wrap kkrlstrm/gtm-research directly (and own the hardening), or re-implement the three patterns natively inside the Orchestrator's enrichment core? (Adopt is faster; native is cleaner long-term and avoids inheriting the un-tested code.)

4. Scope of cache sharing. Research-only, or also route standalone scrapers + signal-bank through it for agency-wide savings?

Verification provenance. kkrlstrm/gtm-research was shallow-cloned and read in full (commit 479f65c); the Orchestrator's research/enrichment was read read-only on this machine. Cost/outcome claims are derived from source mechanics, not benchmarks — the repo's own savings figures are documented as illustrative, which is why Phase 1 is a measured shadow eval. The Railway pooling risk is cross-referenced to the documented 2026-05-21 incident.

LeadGrow GTM — research integration PRD · 2026-06-12