MCIP treats errors as inevitable guests, not unexpected crashes. We classify failures intelligently, degrade gracefully, return partial results when possible, retry smartly, and monitor everything – ensuring users get value even when things go wrong.
Think about driving with a GPS. Sometimes it loses signal in a tunnel. Sometimes it can't find a specific address. Sometimes the route it suggests is blocked. But a good GPS doesn't just display "ERROR" and shut down. It shows you what it knows, suggests alternatives, and keeps trying to help. That's exactly how MCIP handles errors.
In the world of distributed e-commerce, errors aren't exceptional – they're expected. Platforms go down for maintenance. Networks experience congestion. APIs hit rate limits. Databases timeout under load. The question isn't whether errors will occur, but how gracefully we handle them when they do.
MCIP's error handling philosophy is simple: every error is an opportunity to demonstrate resilience. We don't just catch exceptions; we transform them into degraded but useful responses. We don't just retry blindly; we adapt our strategy based on failure patterns. We don't just log errors; we learn from them to prevent future occurrences.
Not all errors deserve the same response. MCIP classifies errors into distinct categories, each with its own handling strategy:
Transient Errors (Level 1): These are temporary hiccups that usually resolve themselves. Network timeouts, momentary service unavailability, temporary rate limits. Like a friend not answering their phone – they're probably just busy, try again in a moment.
Degraded Service Errors (Level 2): The service works but not optimally. Slow responses, partial data, reduced functionality. Like a restaurant that's out of your first choice – you can still eat, just not what you originally wanted.
Platform Errors (Level 3): Specific to one platform or adapter. Authentication failures, API version mismatches, platform-specific outages. Like one store being closed – annoying, but other stores remain open.
System Errors (Level 4): Affect core MCIP functionality. Database failures, critical service outages, infrastructure problems. Like a power outage – serious, but we have generators (fallbacks) ready.
Fatal Errors (Level 5): Unrecoverable failures requiring intervention. Corrupted data, security breaches, complete system failure. Like a fire alarm – evacuate (safe mode) and call for help.
Our error codes aren't random numbers. They tell you exactly what went wrong:
Each code includes subcategories. Error 2001 is a search timeout. Error 2002 is invalid search parameters. Error 2003 is no results found. This granularity helps both debugging and automated recovery.
When primary systems fail, MCIP doesn't give up. We cascade through increasingly degraded but still useful alternatives:
Primary Path: Full RAG-powered semantic search across all platforms
First Fallback: Keyword search if RAG fails
Second Fallback: Cached results if real-time search fails
Third Fallback: Popular products in the category
Final Fallback: Honest error message with helpful suggestions
It's like planning a vacation. First choice: fly direct. If that fails: connecting flight. If that fails: train. If that fails: drive. If that fails: staycation. You might not get to Paris, but you still get a break.
Each MCIP service is isolated to prevent cascade failures. If the embedding service fails, search continues with keywords. If one adapter fails, others continue. If the ranking service fails, we return unranked results. No single failure can take down the entire system.
Think of it like a house with multiple circuit breakers. If the kitchen electricity fails, the lights in the living room still work. One blown fuse doesn't plunge the entire house into darkness. Each service has its own "circuit breaker" that trips independently.
Perfect is the enemy of good. When we can't deliver 100%, we deliver what we can with transparency about what's missing. If we search five platforms and two timeout, we return results from three with a note about the incomplete coverage.
Users appreciate honesty. "Here are products from 3 out of 5 stores. Two stores didn't respond in time, but you can retry to check them" is infinitely more useful than "Error: Search failed."
We degrade progressively, removing non-essential features while preserving core functionality:
Each degradation level maintains usability. It's like a car's limp mode – reduced performance but you still get home.
When retrying failed requests, we don't hammer the service. We use exponential backoff with jitter:
The jitter (random variation) prevents thundering herd problems where all clients retry simultaneously. It's like everyone leaving a concert – staggered exits prevent crushing at the doors.
Circuit breakers prevent repeated calls to failing services:
Closed State: Normal operation, requests pass through
Open State: Service is failing, requests immediately return cached/fallback responses
Half-Open State: Testing recovery, allowing one request through
When a service fails 5 times in 30 seconds, the circuit opens for 60 seconds. After 60 seconds, it enters half-open state, testing with a single request. Success closes the circuit; failure keeps it open longer.
We adjust timeouts based on historical performance. If a platform usually responds in 200ms but has been slow lately, we extend its timeout to 400ms. If it's been consistently fast, we might reduce the timeout to 150ms. These adaptive timeouts balance patience with performance.
Every error generates a structured log entry:
{
"timestamp": "2024-01-15T10:30:45.123Z",
"error_code": 4001,
"severity": "warning",
"service": "shopify-adapter",
"message": "API rate limit exceeded",
"context": {
"query": "laptop",
"platform": "shopify",
"retry_count": 2,
"user_session": "uuid-xxx"
},
"recovery": "Using cached results",
"impact": "degraded"
}This structure enables automated analysis. We can track error trends, identify patterns, and trigger alerts based on specific conditions.
Our monitoring tracks error metrics in real-time:
These dashboards aren't just for debugging – they're for learning. Frequent timeout errors might indicate we need to adjust our time budgets. Regular authentication failures might suggest API changes we need to accommodate.
Not every error triggers an alert. We use intelligent thresholds:
This prevents alert fatigue while ensuring critical issues get immediate attention.
Here's how MCIP handles a real scenario where Shopify's API goes down during a search:
The user experience? They get results in 2 seconds instead of waiting for a timeout, see products from available platforms, and can retry later for complete results. Not perfect, but perfectly usable.