MCIP
Server

Error Handling

MCIP treats errors as inevitable guests, not unexpected crashes. We classify failures intelligently, degrade gracefully, return partial results when possible, retry smartly, and monitor everything – ensuring users get value even when things go wrong.

Errors Are Not Failures, They're Information

Think about driving with a GPS. Sometimes it loses signal in a tunnel. Sometimes it can't find a specific address. Sometimes the route it suggests is blocked. But a good GPS doesn't just display "ERROR" and shut down. It shows you what it knows, suggests alternatives, and keeps trying to help. That's exactly how MCIP handles errors.

In the world of distributed e-commerce, errors aren't exceptional – they're expected. Platforms go down for maintenance. Networks experience congestion. APIs hit rate limits. Databases timeout under load. The question isn't whether errors will occur, but how gracefully we handle them when they do.

MCIP's error handling philosophy is simple: every error is an opportunity to demonstrate resilience. We don't just catch exceptions; we transform them into degraded but useful responses. We don't just retry blindly; we adapt our strategy based on failure patterns. We don't just log errors; we learn from them to prevent future occurrences.


Error Classification: Know Your Enemy

The Error Hierarchy

Not all errors deserve the same response. MCIP classifies errors into distinct categories, each with its own handling strategy:

Transient Errors (Level 1): These are temporary hiccups that usually resolve themselves. Network timeouts, momentary service unavailability, temporary rate limits. Like a friend not answering their phone – they're probably just busy, try again in a moment.

Degraded Service Errors (Level 2): The service works but not optimally. Slow responses, partial data, reduced functionality. Like a restaurant that's out of your first choice – you can still eat, just not what you originally wanted.

Platform Errors (Level 3): Specific to one platform or adapter. Authentication failures, API version mismatches, platform-specific outages. Like one store being closed – annoying, but other stores remain open.

System Errors (Level 4): Affect core MCIP functionality. Database failures, critical service outages, infrastructure problems. Like a power outage – serious, but we have generators (fallbacks) ready.

Fatal Errors (Level 5): Unrecoverable failures requiring intervention. Corrupted data, security breaches, complete system failure. Like a fire alarm – evacuate (safe mode) and call for help.

Error Codes That Tell Stories

Our error codes aren't random numbers. They tell you exactly what went wrong:

  • 1000-1999: Session and state errors
  • 2000-2999: Search and query errors
  • 3000-3999: Cart and transaction errors
  • 4000-4999: Platform and adapter errors
  • 5000-5999: Rate limiting and throttling
  • 6000-6999: Authentication and authorization
  • 7000-7999: Data validation and integrity
  • 8000-8999: Infrastructure and system errors
  • 9000-9999: Critical security and fatal errors

Each code includes subcategories. Error 2001 is a search timeout. Error 2002 is invalid search parameters. Error 2003 is no results found. This granularity helps both debugging and automated recovery.


Graceful Degradation: The Art of Failing Well

The Fallback Cascade

When primary systems fail, MCIP doesn't give up. We cascade through increasingly degraded but still useful alternatives:

Primary Path: Full RAG-powered semantic search across all platforms

First Fallback: Keyword search if RAG fails

Second Fallback: Cached results if real-time search fails

Third Fallback: Popular products in the category

Final Fallback: Honest error message with helpful suggestions

It's like planning a vacation. First choice: fly direct. If that fails: connecting flight. If that fails: train. If that fails: drive. If that fails: staycation. You might not get to Paris, but you still get a break.

Service Isolation

Each MCIP service is isolated to prevent cascade failures. If the embedding service fails, search continues with keywords. If one adapter fails, others continue. If the ranking service fails, we return unranked results. No single failure can take down the entire system.

Think of it like a house with multiple circuit breakers. If the kitchen electricity fails, the lights in the living room still work. One blown fuse doesn't plunge the entire house into darkness. Each service has its own "circuit breaker" that trips independently.


Partial Results: Something Beats Nothing

The Incomplete Success Philosophy

Perfect is the enemy of good. When we can't deliver 100%, we deliver what we can with transparency about what's missing. If we search five platforms and two timeout, we return results from three with a note about the incomplete coverage.

Users appreciate honesty. "Here are products from 3 out of 5 stores. Two stores didn't respond in time, but you can retry to check them" is infinitely more useful than "Error: Search failed."

Progressive Degradation

We degrade progressively, removing non-essential features while preserving core functionality:

  1. Full Feature: Semantic search with all enrichments
  2. Reduced Feature: Basic search without recommendations
  3. Minimal Feature: Simple keyword matching
  4. Emergency Mode: Browse categories only
  5. Maintenance Mode: Clear message with retry option

Each degradation level maintains usability. It's like a car's limp mode – reduced performance but you still get home.


Retry Strategies: Smart Persistence

Exponential Backoff

When retrying failed requests, we don't hammer the service. We use exponential backoff with jitter:

  • First retry: 100ms wait
  • Second retry: 300ms wait
  • Third retry: 900ms wait
  • Fourth retry: 2.7 seconds wait

The jitter (random variation) prevents thundering herd problems where all clients retry simultaneously. It's like everyone leaving a concert – staggered exits prevent crushing at the doors.

Circuit Breaker Pattern

Circuit breakers prevent repeated calls to failing services:

Closed State: Normal operation, requests pass through

Open State: Service is failing, requests immediately return cached/fallback responses

Half-Open State: Testing recovery, allowing one request through

When a service fails 5 times in 30 seconds, the circuit opens for 60 seconds. After 60 seconds, it enters half-open state, testing with a single request. Success closes the circuit; failure keeps it open longer.

Adaptive Timeouts

We adjust timeouts based on historical performance. If a platform usually responds in 200ms but has been slow lately, we extend its timeout to 400ms. If it's been consistently fast, we might reduce the timeout to 150ms. These adaptive timeouts balance patience with performance.


Logging and Monitoring: Learning from Failure

Structured Error Logging

Every error generates a structured log entry:

{
  "timestamp": "2024-01-15T10:30:45.123Z",
  "error_code": 4001,
  "severity": "warning",
  "service": "shopify-adapter",
  "message": "API rate limit exceeded",
  "context": {
    "query": "laptop",
    "platform": "shopify",
    "retry_count": 2,
    "user_session": "uuid-xxx"
  },
  "recovery": "Using cached results",
  "impact": "degraded"
}

This structure enables automated analysis. We can track error trends, identify patterns, and trigger alerts based on specific conditions.

Monitoring Dashboards

Our monitoring tracks error metrics in real-time:

  • Error Rate: Errors per minute/hour/day
  • Error Distribution: Which errors occur most frequently
  • Recovery Success: How often fallbacks work
  • Impact Analysis: How errors affect user experience
  • Platform Health: Which platforms are struggling

These dashboards aren't just for debugging – they're for learning. Frequent timeout errors might indicate we need to adjust our time budgets. Regular authentication failures might suggest API changes we need to accommodate.

Alerting Intelligence

Not every error triggers an alert. We use intelligent thresholds:

  • Single errors: Log but don't alert
  • Error clusters: Alert if 10+ similar errors in 5 minutes
  • Critical errors: Immediate alerts
  • Degraded service: Alert if lasting >10 minutes
  • Recovery notifications: Alert when services recover

This prevents alert fatigue while ensuring critical issues get immediate attention.


Error Recovery in Action

Here's how MCIP handles a real scenario where Shopify's API goes down during a search:

  1. Detection: Shopify adapter times out after 1.5 seconds
  2. Classification: Marked as Level 3 Platform Error
  3. Circuit Break: After 3 failures, circuit breaker opens
  4. Notification: User sees "Searching 4 out of 5 stores"
  5. Partial Results: Results from other platforms returned
  6. Cache Check: Recent Shopify results added if available
  7. Logging: Error logged with full context
  8. Monitoring: Dashboard shows Shopify degradation
  9. Recovery Test: Circuit breaker tests after 60 seconds
  10. Resolution: Normal service resumes when Shopify recovers

The user experience? They get results in 2 seconds instead of waiting for a timeout, see products from available platforms, and can retry later for complete results. Not perfect, but perfectly usable.