MCIP
Architecture

Architecture Overview

MCIP uses a three-layer architecture with NestJS at its core, RAG for semantic intelligence, and real-time adapters for platform integration, delivering sub-500ms search across any e-commerce platform.

System Context

MCIP operates as an intelligent middleware layer between AI agents and e-commerce platforms. Unlike traditional API gateways that simply route requests, MCIP adds semantic understanding, session management, and protocol translation.

External Dependencies

AI Services power our semantic understanding:

  • OpenAI API: Generates 512-dimensional embeddings using text-embedding-3-small. We chose this model for its optimal balance of speed (150ms) and accuracy for e-commerce contexts.
  • Pinecone: Stores and searches vectors with cosine similarity. Handles 1M+ vectors with <250ms query time at p95.

Infrastructure Services ensure reliability:

  • Redis: Manages sessions with 24-hour TTL, maintaining cart state and search history. Chosen for its atomic operations and proven stability at scale.
  • Docker: Containerizes everything for consistent deployment across environments.

Component Architecture

MCIP's internal architecture follows Domain-Driven Design principles with clear separation of concerns:

Layer Responsibilities

Presentation Layer handles protocol translation:

  • MCP Handler: Implements Model Context Protocol with Zod validation
  • REST API: Fallback for non-MCP clients
  • WebSocket: Real-time updates for cart changes

Application Layer orchestrates business logic:

  • Tool Service: Manages the five MCP tools (search, add_to_cart, view_cart, update_cart, clear_cart)
  • Search Orchestrator: Coordinates parallel searches with 2500ms timeout budget
  • Cart Manager: Handles complex cart operations with conflict resolution

Domain Layer contains core business logic:

  • Product Service: Normalizes products from different platforms
  • Session Service: Isolates machine customers with UUID-based sessions
  • Embedding Service: Manages the RAG pipeline for semantic understanding

Infrastructure Layer connects to external systems:

  • Platform Adapters: Translate MCIP requests to platform-specific APIs
  • External Clients: Handle authentication, retries, and rate limiting

Technology Stack

Core Framework: NestJS

We chose NestJS over Express or Fastify for several critical reasons:

  • Dependency Injection: Clean separation between layers
  • TypeScript First: Type safety across 40,000+ lines of code
  • Modular Architecture: Each adapter is a separate module
  • Built-in Testing: 87% test coverage achieved easily
  • Decorator Pattern: Perfect for MCP tool implementation
@MCPTool({
    name: 'search_product',
    description: 'Search products with semantic understanding',
    schema: SearchProductSchema
})
async searchProduct(params: SearchParams): Promise<ProductResult[]> {
    // Implementation leverages DI for all services
}

The RAG Pipeline

Our semantic search pipeline transforms queries through multiple stages:

Why 512 dimensions?

We tested 1536-dim (text-embedding-3-large) but found:

  • Only 2% accuracy improvement for e-commerce queries
  • 3x slower embedding generation
  • 4x higher memory usage in Pinecone
  • 512-dim delivers 89% relevance at 150ms

Data Storage Strategy

No Product Database - This is crucial to understand:

  • We store embeddings (vectors) not products
  • Products are fetched real-time from source systems
  • Cache layer only stores session data and recent searches
  • This ensures data freshness and reduces infrastructure costs

Performance Characteristics

Real-World Metrics

Based on production monitoring across 10,000+ daily searches:

OperationP50P95P99Target
Embedding Generation145ms189ms212ms<200ms
Vector Search238ms287ms342ms<300ms
Single Store Fetch180ms450ms890ms<1000ms
Total Search (1 store)421ms498ms587ms<500ms
Total Search (3 stores)1,243ms2,187ms2,876ms<3000ms
Cart Operations12ms34ms67ms<100ms

Optimization Strategies

Parallel Processing is our secret weapon:

// All stores searched simultaneously
const results = await Promise.allSettled(
    stores.map(store => 
        Promise.race([
            store.search(query),
            timeout(1500) // Per-store timeout
        ])
    )
);
// Failed stores don't block others

Intelligent Caching reduces redundant work:

  • Embedding cache: Common queries cached for 1 hour
  • Result cache: Identical searches cached for 5 minutes
  • Session cache: Cart state persisted for 24 hours

Graceful Degradation ensures reliability:

  • If embeddings fail: Fallback to keyword search
  • If store times out: Return partial results
  • If vector DB is slow: Use cached results

Scalability Architecture

Horizontal Scaling Pattern

MCIP scales horizontally with stateless application servers:

Scaling Metrics

Current Capacity (single instance):

  • 1,247 requests/second
  • 100 concurrent sessions
  • 10 stores searched in parallel

Scaled Capacity (3-node cluster):

  • 3,500+ requests/second
  • 1,000+ concurrent sessions
  • Network becomes bottleneck before CPU

Auto-scaling Triggers

We scale based on multiple metrics:

  1. CPU Usage > 70% for 2 minutes → Add instance
  2. Response Time P95 > 600ms → Add instance
  3. Queue Depth > 100 pending requests → Add instance
  4. Time-based → Pre-scale for predictable traffic

Architecture Decisions

Why Monolithic (For Now)

We started with a monolith instead of microservices because:

  1. Faster iteration - Single deployment, unified logging
  2. Lower complexity - No service discovery, no distributed tracing needed yet
  3. Cost effective - One service to run instead of 10
  4. Performance - No network hops between services

We're prepared to extract services when needed:

  • Search Service (when we hit 10K req/s)
  • Embedding Service (if we switch AI providers)
  • Cart Service (for persistent cart requirements)

Why Real-Time Over Cached

Every architecture decision has trade-offs. We chose real-time fetching over maintaining a product cache because:

Advantages:

  • Always accurate inventory
  • No sync delays
  • No storage costs
  • Simpler architecture

Trade-offs:

  • Higher latency per request
  • Dependency on store APIs
  • More complex error handling

For machine customers making purchase decisions, accuracy trumps speed.


Monitoring and Observability

Key Metrics We Track

Business Metrics:

  • Search relevance score (target: >85%)
  • Cart abandonment rate
  • Cross-store search percentage
  • Average products per search

Technical Metrics:

  • API latency (p50, p95, p99)
  • Embedding cache hit rate (target: >40%)
  • Store adapter success rate
  • Redis memory usage

Health Indicators:

GET /health
{
    "status": "healthy",
    "uptime": 425234,
    "memory": { "used": "1.2GB", "limit": "4GB" },
    "redis": "connected",
    "pinecone": "healthy",
    "adapters": {
        "vendure": "online",
        "shopify": "online",
        "woocommerce": "degraded"
    }
}

Security Considerations

While detailed security is covered elsewhere, the architecture implements defense in depth:

  1. API Gateway - Rate limiting, DDoS protection
  2. Application - Input validation, session isolation
  3. Infrastructure - Secrets management, TLS everywhere
  4. Data - No PII storage, encrypted sessions

What's Next


Summary

MCIP's architecture balances simplicity with sophistication. We've built a system that's easy to deploy (single Docker container) yet powerful enough to handle the complexity of semantic search across heterogeneous e-commerce platforms.

The key insight: by focusing on protocol translation rather than data aggregation, we've created an architecture that scales with the growth of machine customers while maintaining the flexibility to evolve.