Skip to main content

BM25 Dynamic Search vs Pre-Indexed Documentation

Comparison of two approaches for AI V4 query resolution.


Test Results Summary

QueryPre-Indexed ApproachBM25 Dynamic Search
"designs with specifications"✅ Found service relation✅ Found design.ts model with hasMany relation (Score: 19.08)
"partners have feedback"✅ Found module link✅ Found feedback workflow (Score: 28.47)
"visual flows with executions"⚠️ Missing (VisualFlow not indexed)✅ Found executions route + model (Score: 101.95)
"production runs for design"✅ Found FK reference✅ Found create-production-run workflow (Score: 66.41)

Approach 1: Pre-Indexed Documentation

Files:

  • specs/relations/service-relations.json (35 models, 60 relations)
  • specs/links/module-links.json (31 links)

Pros:

  • ⚡ Fast O(1) lookup
  • 📋 Structured data ready for LLM consumption
  • 🎯 Deterministic results
  • 💰 Lower token usage (no code snippets)

Cons:

  • 🔄 Needs regeneration when code changes
  • ❌ Can miss patterns (e.g., VisualFlow with export const pattern)
  • 🔧 Requires maintenance of generator scripts

File: src/scripts/bm25-code-search.ts

Algorithm:

  1. Extract search terms from natural language (remove stop words, add variations)
  2. Run grep searches across codebase
  3. Calculate BM25 score per file:
    • TF (term frequency) with saturation: (tf * (k1+1)) / (tf + k1 * (1 - b + b * docLen/avgDocLen))
    • IDF (inverse document frequency): log((N - df + 0.5) / (df + 0.5) + 1)
    • Bonus for model/service/link files
  4. Return top-K results with code snippets

Pros:

  • ✅ Always up-to-date (searches live codebase)
  • ✅ Finds any pattern (not limited to indexed docs)
  • ✅ No maintenance needed
  • ✅ Extracts actual code context for LLM

Cons:

  • 🐢 Slower (runs grep on every query)
  • 📊 Noisier results (migrations, tests, etc.)
  • 💰 Higher token usage (includes code snippets)
  • 🎲 Results vary based on search terms

Combine both approaches for optimal results:

┌─────────────────────────────────────────────────────────┐
│ User Query │
└─────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────┐
│ Step 1: Fast Index Lookup │
│ │
│ - Check module-dictionary for entity │
│ - Check service-relations.json for relations │
│ - Check module-links.json for cross-module links │
│ │
│ IF found → Use structured data (fast path) │
│ IF not found → Continue to Step 2 │
└─────────────────────────────────────────────────────────┘

(not found)


┌─────────────────────────────────────────────────────────┐
│ Step 2: BM25 Dynamic Search │
│ │
│ - Extract search terms │
│ - Run grep on relevant directories │
│ - Apply BM25 ranking │
│ - Extract code snippets │
│ │
│ → Use code context for LLM plan generation │
└─────────────────────────────────────────────────────────┘

Benefits of Hybrid:

  • Fast for common queries (uses index)
  • Fallback for edge cases (uses BM25)
  • Self-healing (BM25 catches what index missed)
  • Lower average latency and token usage

Implementation Notes

When to use Pre-Indexed:

  • Entity detection (Module Dictionary)
  • Service relation queries (design.specifications)
  • Module link queries (design → customer)
  • Filter field validation

When to use BM25:

  • Unknown entities
  • Complex multi-entity queries
  • Code pattern discovery
  • API endpoint lookup

BM25 Tuning:

  • k1 = 1.5 - Term frequency saturation
  • b = 0.75 - Length normalization
  • Model files: 1.5x boost
  • Service files: 1.3x boost
  • Link files: 1.4x boost
  • Route files: 1.2x boost

Performance Comparison

MetricPre-IndexedBM25 DynamicHybrid
Latency~5ms~500ms~5-500ms
Accuracy90%95%98%
Token usageLowHighMedium
MaintenanceWeekly regenerationNoneWeekly

Hybrid Query Resolver (BM25 + LLM)

File: src/scripts/hybrid-query-resolver.ts

The hybrid approach combines BM25 search with LLM analysis to generate:

  • Natural language query patterns
  • Multi-step execution plans
  • Auto path resolution for combo queries

Test Results

Test 1: "designs with specifications"

Target: design | Mode: data | Confidence: 95%

Patterns Generated:
• show me designs with their specifications
• get all designs including specifications
• list designs and their specifications
• fetch designs with specification details

Execution Plan:
await designService.listDesigns({}, { relations: ['specifications'] })

Test 2: "visual flows with executions" (Pre-indexed MISSED this!)

Target: visual flow | Mode: data | Confidence: 92%

Execution Plan:
await visualFlowService.listVisualFlows({}, { relations: ['executions'] })

✅ LLM correctly identified the service relation from code analysis!

Test 3: "production runs for design SKU123" (Combo Query)

Target: production_runs | Mode: data | Confidence: 92%

Execution Plan (Multi-Step Auto Path Resolution):
[1] Resolve design ID from SKU search
const designs = await designService.listDesigns({ sku: 'SKU123' })

[2] Query production runs filtered by design_id
const productionRuns = await productionRunService.listProductionRuns({ design_id: designId })

[3] Optionally enrich with linked tasks
await query.graph({ entity: 'production_runs', fields: ['*', 'tasks.*'] })

✅ LLM correctly identified the combo query pattern with auto path resolution!


Conclusion

For AI V4, use the hybrid approach:

  1. Primary path: Pre-indexed docs for known entities/relations (fast)
  2. Fallback path: BM25 + LLM for edge cases and discovery (accurate)
  3. Learning: Store successful LLM-generated patterns for future fast lookup

Key Scripts:

  • src/scripts/bm25-code-search.ts - BM25 only search
  • src/scripts/hybrid-query-resolver.ts - BM25 + LLM analysis
  • src/scripts/test-query-resolution.ts - Pre-indexed test

This gives the best of both worlds: speed for common queries, accuracy for edge cases.