Transform MCIP from a local experiment into a production-ready system in under an hour. This guide covers configuration, optimization, monitoring, and deployment strategies that work.
Getting MCIP running locally is straightforward—you've probably already done it. Taking it to production is a different adventure entirely.
Think of it like cooking: following a recipe at home is one thing, running a restaurant kitchen is another. You need consistency, reliability, and the ability to handle the dinner rush without breaking a sweat.
This guide walks you through that transformation. We'll cover everything from configuration to deployment, with real examples and honest advice about what actually matters. No fluff, no unnecessary complexity—just what you need to go live with confidence.
MCIP uses a layered configuration approach. Think of it like dressing for unpredictable weather—base layer, insulation, outer shell. Each layer serves a purpose:
Here's your production environment template. Every variable earns its place:
# .env.production
# Core Application Settings
NODE_ENV=production
PORT=8000
LOG_LEVEL=info
# AI Services - The brain of semantic search
OPENAI_API_KEY=sk-your-production-key
OPENAI_MODEL=text-embedding-3-small
EMBEDDING_DIMENSIONS=1536 # Default for text-embedding-3-small
EMBEDDING_TIMEOUT_MS=3000
# Vector Database - Where meaning lives (Qdrant)
QDRANT_URL=http://qdrant:6333
QDRANT_TIMEOUT_MS=5000
QDRANT_COLLECTION=products
# Session Storage - Redis keeps conversations alive
REDIS_URL=redis://redis-primary:6379
REDIS_PASSWORD=your-secure-password
SESSION_TTL_HOURS=24
# Performance Tuning
SEARCH_TIMEOUT_MS=2500
MAX_CONCURRENT_SEARCHES=10
RESULT_CACHE_TTL_SECONDS=300
EMBEDDING_CACHE_TTL_SECONDS=3600
# Security
CORS_ORIGINS=Never commit secrets to version control. This sounds obvious, but it happens more than you'd think. Use a secrets manager like AWS Secrets Manager, HashiCorp Vault, or at minimum, environment-specific .env files that are gitignored.
Use different API keys per environment. Your production OpenAI key should be separate from development. This prevents accidental quota exhaustion and provides clearer billing.
Set reasonable timeouts. The defaults assume perfect conditions. In production, networks hiccup, services lag, and users get impatient.
| Service | Development | Production | Why |
|---|---|---|---|
| Embedding API | 5000ms | 3000ms | Fail fast, don't block |
| Vector Search | 10000ms | 5000ms | Users won't wait longer |
| Total Search | 5000ms | 2500ms | Aggregate timeout |
| Redis | 1000ms | 500ms | Should be nearly instant |
Before optimizing anything, understand where your search latency comes from. Here's a typical breakdown:
Total Search: 450ms ├── Query Processing: 15ms (3%) ├── Embedding Generation: 150ms (33%) ├── Vector Search: 250ms (56%) ├── Result Enrichment: 25ms (6%) └── Response Formatting: 10ms (2%)
The insight? Embedding generation and vector search dominate. That's where optimization efforts pay off.
Many searches are variations of common queries. "Gaming laptop," "laptop for gaming," "gaming notebook"—they're semantically similar. Caching embeddings for frequent queries dramatically reduces latency for repeat searches.
// config/cache.config.ts
export const cacheConfig = {
embedding: {
enabled: true,
ttl: 3600, // 1 hour
maxSize: 10000, // Store up to 10k unique query embeddings
strategy: 'lru', // Least Recently Used eviction
keyNormalizer: (query: string) => {
return query
.toLowerCase()
.trim()
.replace(/\s+/g, ' ')
.substring(0, 200);
}
},
results: {
enabled: true,
ttl: 300, // 5 minutes
maxSize: 5000
}
};Expected impact: 40-60% latency reduction for cached queries, which often represent 30-50% of total traffic.
Opening new connections for every request is expensive. Pool them:
// config/connections.config.ts
export const connectionConfig = {
redis: {
maxConnections: 50,
minConnections: 10,
acquireTimeout: 1000,
idleTimeout: 30000
},
http: {
maxSockets: 100,
maxFreeSockets: 20,
keepAlive: true,
keepAliveMsecs: 30000
}
};Node.js applications can be memory-hungry. In production, set explicit limits:
# docker-compose.production.yml
services:
mcip:
image: mcip:latest
deploy:
resources:
limits:
cpus: '2'
memory: 4G
reservations:
cpus: '1'
memory: 2G
environment:
- NODE_OPTIONS=--max-old-space-size=3584Pro tip: Set max-old-space-size to about 90% of your container memory limit. This gives Node.js room to breathe during garbage collection.
Here's a truth about production systems: if you're not watching, you're guessing. Monitoring transforms "the system feels slow" into "embedding latency increased 40% after the last deployment."
Think of metrics in three categories:
Health Metrics (Is it working?): Request success rate, Error rate by type, System uptime
Performance Metrics (How well?): Response time percentiles (P50, P95, P99), Throughput (requests per second), Queue depths
Business Metrics (Is it valuable?): Search relevance scores, Session duration
MCIP exposes metrics at /metrics in Prometheus format:
// config/metrics.config.ts
export const metricsProviders = [
{
name: 'mcip_search_duration_seconds',
help: 'Search request duration in seconds',
labelNames: ['status', 'cache_hit'],
buckets: [0.1, 0.25, 0.5, 0.75, 1, 2.5, 5]
},
{
name: 'mcip_embedding_duration_seconds',
help: 'Embedding generation duration',
labelNames: ['model'],
buckets: [0.05, 0.1, 0.15, 0.2, 0.3, 0.5]
},
{
name: 'mcip_requests_total',
help: 'Total number of requests',
labelNames: ['method', 'status']
}
];# alerts/mcip-alerts.yml
groups:
- name: mcip-critical
rules:
- alert: HighErrorRate
expr: |
sum(rate(mcip_requests_total{status="error"}[5m]))
/ sum(rate(mcip_requests_total[5m])) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "MCIP error rate above 5%"
- alert: HighSearchLatency
expr: |
histogram_quantile(0.95, rate(mcip_search_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Search P95 latency above 1 second"Perfect for getting started or low-traffic deployments:
services:
mcip:
image: mcip:${VERSION:-latest}
ports:
- "8000:8000"
environment:
- NODE_ENV=production
env_file:
- .env.production
depends_on:
redis:
condition: service_healthy
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
restart: unless-stopped
redis:
image: redis:7-alpine
command: redis-server --requirepass ${REDIS_PASSWORD}
volumes:
- redis-data:/data
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
restart: unless-stopped
volumes:
redis-data:Best for: Development teams, proof-of-concept, <100 requests/minute
Multiple MCIP instances behind a load balancer:
# docker-compose.balanced.yml
version: '3.8'
services:
nginx:
image: nginx:alpine
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
depends_on:
- mcip1
- mcip2
- mcip3
mcip1:
image: mcip:${VERSION:-latest}
env_file: .env.production
depends_on:
- redis
mcip2:
image: mcip:${VERSION:-latest}
env_file: .env.production
depends_on:
- redis
mcip3:
image: mcip:${VERSION:-latest}
env_file: .env.production
depends_on:
- redis
redis:
image: redis:7-alpine
command: redis-server --requirepass ${REDIS_PASSWORD} --appendonly yes
volumes:
- redis-data:/data
volumes:
redis-data:Best for: Production workloads, 100-1000 requests/minute
For serious scale and operational maturity:
# k8s/deployment.yml
apiVersion: apps/v1
kind: Deployment
metadata:
name: mcip
spec:
replicas: 3
selector:
matchLabels:
app: mcip
template:
metadata:
labels:
app: mcip
spec:
containers:
- name: mcip
image: mcip:latest
ports:
- containerPort: 8000
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "2000m"
memory: "4Gi"
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 5
periodSeconds: 5
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: mcip-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: mcip
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70Best for: High-traffic production, 1000+ requests/minute
Symptoms: P95 search times above 1 second
Common fixes: Increase embedding cache TTL, Add more MCIP instances, Review and optimize slow queries, Check network latency to external services
Symptoms: Container memory steadily increasing, eventual OOM
Common fixes: Reduce session TTL, Lower cache max sizes, Implement cache eviction, Set explicit memory limits
Symptoms: Intermittent failures to Redis, Qdrant, or OpenAI
Common fixes: Increase connection pool size, Implement retry with backoff, Distribute load across instances
Congratulations! You've transformed MCIP from a local experiment into a production-ready system. Here's where to go from here:
Remember: production systems are living things. They need attention, care, and occasional adjustment.