Overview
This guide provides solutions for common issues encountered when deploying and operating CrewAI Platform on Kubernetes. For additional support, generate a support bundle and contact your CrewAI representative.
Diagnostic Commands
Before troubleshooting specific issues, gather diagnostic information:
# Check all CrewAI Platform resources
kubectl get all -l app.kubernetes.io/name=crewai-platform
# View recent events
kubectl get events --sort-by='.lastTimestamp' | head -20
# Check pod status
kubectl get pods -o wide
# View logs for specific component
kubectl logs -l app.kubernetes.io/component=web --tail=100
kubectl logs -l app.kubernetes.io/component=worker --tail=100
# Check resource usage
kubectl top nodes
kubectl top pods
Common Issues
Pod CrashLoopBackOff
Symptoms:
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# crewai-web-xxx 0/1 CrashLoopBackOff 5 3m
Common Causes:
- Missing required secrets
- Database connection failure
- Resource limits too restrictive
- Invalid configuration
Diagnosis:
kubectl logs POD_NAME
kubectl describe pod POD_NAME
Pod Startup Failures
Diagnosis:
- Check pod logs for errors
- Verify resource limits are not too restrictive
- Check secret availability
- Verify image pull secrets are configured
# Check pod status and logs
kubectl get pods
kubectl describe pod POD_NAME
kubectl logs POD_NAME
Database Connection Issues
Symptoms: Logs show could not connect to server: Connection refused
Diagnosis:
If the application cannot connect to the database:
- Verify
DB_HOST is set correctly for external databases
- Check database credentials in secrets
- Ensure database allows connections from Kubernetes cluster
- Verify database name and port configuration
- Check security groups/firewall rules
# Check database connectivity from a pod
kubectl run -it --rm debug --image=postgres:15 --restart=Never -- \
psql -h DB_HOST -U DB_USER -d POSTGRES_DB
Storage/S3 Connection Issues
Diagnosis:
- Verify AWS credentials are correct
- Check bucket name and region
- Ensure IAM permissions allow bucket access
- For MinIO, verify endpoint URL is accessible
Image Pull Errors
Symptoms: ErrImagePull, ImagePullBackOff
Solutions:
- Verify image name and tag exist
- Check pull secret configuration
- Verify registry credentials
- Check network connectivity to registry
Ingress Not Accessible
Symptoms: Cannot reach application via ingress hostname
Diagnosis:
# Check ingress status
kubectl get ingress
kubectl describe ingress crewai-ingress
Solutions:
- Verify ingress controller is installed
- Check ingress className matches your controller
- Verify DNS points to ingress load balancer
- Check TLS certificate configuration
Out of Memory (OOMKilled)
Symptoms: Pods restarting with OOMKilled
Solutions:
- Increase memory limits
- Tune
WEB_CONCURRENCY and RAILS_MAX_THREADS
- Monitor actual memory usage
- Check for memory leaks
# Check memory usage patterns
kubectl top pods --sort-by=memory
# Increase memory limits in values.yaml
web:
resources:
limits:
memory: "16Gi" # Increase from 12Gi default
requests:
memory: "8Gi"
# Reduce concurrency to lower memory usage
envVars:
WEB_CONCURRENCY: 2 # Reduce worker processes
RAILS_MAX_THREADS: 3 # Reduce threads per process
BuildKit Build Failures
Symptoms: Crew builds fail, BuildKit errors in logs
Common Causes:
- Registry authentication issues
- Insufficient BuildKit resources
- Network connectivity problems
Solutions:
# Check BuildKit pod status
kubectl get pods -l app.kubernetes.io/component=buildkit
# View BuildKit logs
kubectl logs -l app.kubernetes.io/component=buildkit --tail=100
# Verify registry authentication
kubectl get secret docker-registry -o yaml
# Increase BuildKit resources if needed
buildkit:
resources:
limits:
cpu: "8"
memory: "16Gi"
# Configure insecure registry if using internal registry
buildkit:
registries:
- hostname: "registry.internal.company.com"
insecure: true
http: true
Persistent Volume Issues
Symptoms: Pods stuck in Pending state, PVC not binding
Solutions:
# Check PVC status
kubectl get pvc
# Describe PVC for events
kubectl describe pvc crewai-postgres-data-crewai-postgres-0
# Check StorageClass availability
kubectl get storageclass
# Check if default StorageClass is set
kubectl get storageclass -o jsonpath='{range .items[?(@.metadata.annotations.storageclass\.kubernetes\.io/is-default-class=="true")]}{.metadata.name}{"\n"}{end}'
# Set default StorageClass if needed
kubectl patch storageclass gp2 -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
Authentication Provider Issues
Symptoms: Unable to log in, OAuth errors
Solutions:
Entra ID Issues
# Verify secrets are set
kubectl get secret crewai-secrets -o jsonpath='{.data.ENTRA_ID_CLIENT_ID}' | base64 -d
kubectl get secret crewai-secrets -o jsonpath='{.data.ENTRA_ID_TENANT_ID}' | base64 -d
# Check redirect URI configuration
# Must match: https://YOUR_HOST/auth/entra_id/callback
# Verify APPLICATION_HOST matches your domain
kubectl exec deploy/crewai-web -- printenv APPLICATION_HOST
Okta Issues
# Verify Okta configuration
kubectl get secret crewai-secrets -o jsonpath='{.data.OKTA_SITE}' | base64 -d
kubectl get secret crewai-secrets -o jsonpath='{.data.OKTA_CLIENT_ID}' | base64 -d
# Check redirect URI in Okta
# Must match: https://YOUR_HOST/auth/okta/callback
Secret Management Issues
Symptoms: Pods fail to start, missing secret errors
Solutions:
# Check if secrets exist
kubectl get secret crewai-secrets
# Verify secret contents (be careful with output)
kubectl describe secret crewai-secrets
# For external secret store issues
kubectl get secretstore
kubectl get externalsecret
kubectl describe externalsecret crewai-external-secret
# Check External Secrets Operator logs
kubectl logs -n external-secrets-operator deployment/external-secrets
Configuration Warnings
RAILS_MASTER_KEY Warning
Warning Message:
⚠️ WARNING: RAILS_MASTER_KEY detected in your configuration.
This key is automatically managed by the chart and should not be set manually.
Please remove RAILS_MASTER_KEY from your values.
Cause: You have manually configured RAILS_MASTER_KEY in either envVars or secrets in your values file.
Solution:
The chart automatically manages RAILS_MASTER_KEY and does not require manual configuration. Remove this setting from your values file:
# Remove these lines from your values.yaml:
envVars:
RAILS_MASTER_KEY: "..." # Remove this
secrets:
RAILS_MASTER_KEY: "..." # Remove this
Then upgrade your deployment:
helm upgrade crewai-platform \
oci://registry.crewai.com/crewai/stable/crewai-platform \
--values my-values.yaml
Why This Matters: The chart uses a different Rails configuration approach that doesn’t require RAILS_MASTER_KEY. Setting it manually can cause configuration conflicts.
Symptoms: Slow response times, high latency
Diagnostic Steps:
# Check resource utilization
kubectl top nodes
kubectl top pods
# Review slow query logs (if configured)
kubectl logs -l app.kubernetes.io/component=web | grep "Slow query"
Solutions:
- Scale web replicas:
web:
replicaCount: 4 # Increase from 2
- Increase resources:
web:
resources:
limits:
cpu: "8"
memory: "16Gi"
requests:
cpu: "2000m"
memory: "8Gi"
- Tune concurrency:
envVars:
WEB_CONCURRENCY: 4
RAILS_MAX_THREADS: 10
- Add database read replicas (configure in external database)
Support and Resources
Documentation
Generate Support Bundle
The support bundle collects comprehensive diagnostics for troubleshooting:
# Install the support-bundle plugin (if not already installed)
kubectl krew install support-bundle
# Generate support bundle (automatically detects cluster specs)
kubectl support-bundle --load-cluster-specs
# Support bundle includes:
# - All pod logs
# - Resource configurations
# - Cluster state
# - Event history
# - Secret names (not values)
The support bundle will be saved as a .tar.gz file that can be shared with CrewAI support for analysis.
Share the generated support bundle file with CrewAI support for faster issue resolution.
Quick Diagnostic Commands
# Check all CrewAI resources
kubectl get all -l app.kubernetes.io/name=crewai-platform
# View recent events
kubectl get events --sort-by='.lastTimestamp'
# Component logs
kubectl logs -l app.kubernetes.io/component=web --tail=100 -f
kubectl logs -l app.kubernetes.io/component=worker --tail=100 -f
kubectl logs -l app.kubernetes.io/component=buildkit --tail=100
# Check resource usage
kubectl top nodes
kubectl top pods
# Describe problematic pod
kubectl describe pod <POD_NAME>
For assistance with CrewAI Platform:
- Customer Portal:
https://enterprise.crewai.com/crewai
- Support Team: Contact your CrewAI representative
- Emergency Issues: Generate and share support bundle with your support team
- Release History:
https://enterprise.crewai.com/crewai/release-history