Overview
This guide provides solutions for common issues encountered when deploying and operating CrewAI Platform on Kubernetes. For additional support, generate a support bundle and contact your CrewAI representative.
Diagnostic Commands
Before troubleshooting specific issues, gather diagnostic information:
# Check all CrewAI Platform resources
kubectl get all -l app.kubernetes.io/name=crewai-platform
# View recent events
kubectl get events --sort-by='.lastTimestamp' | head -20
# Check pod status
kubectl get pods -o wide
# View logs for specific component
kubectl logs -l app.kubernetes.io/component=web --tail=100
kubectl logs -l app.kubernetes.io/component=worker --tail=100
# Check resource usage
kubectl top nodes
kubectl top pods
Common Issues
Pod CrashLoopBackOff
Symptoms:
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# crewai-web-xxx 0/1 CrashLoopBackOff 5 3m
Common Causes:
- Missing required secrets
- Database connection failure
- Resource limits too restrictive
- Invalid configuration
Diagnosis:
kubectl logs POD_NAME
kubectl describe pod POD_NAME
Pod Startup Failures
Diagnosis:
- Check pod logs for errors
- Verify resource limits are not too restrictive
- Check secret availability
- Verify image pull secrets are configured
# Check pod status and logs
kubectl get pods
kubectl describe pod POD_NAME
kubectl logs POD_NAME
Database Connection Issues
Symptoms: Logs show could not connect to server: Connection refused
Diagnosis:
If the application cannot connect to the database:
- Verify
DB_HOST is set correctly for external databases
- Check database credentials in secrets
- Ensure database allows connections from Kubernetes cluster
- Verify database name and port configuration
- Check security groups/firewall rules
- If using Built-in Integrations (
oauth.enabled: true) with an external database, ensure the OAuth database exists (default: oauth_db)
- If using Wharf (
wharf.enabled: true) with an external database, ensure the Wharf database exists (default: wharf)
# Check database connectivity from a pod
kubectl run -it --rm debug --image=postgres:15 --restart=Never -- \
psql -h DB_HOST -U DB_USER -d POSTGRES_DB
# Verify OAuth database exists (if oauth.enabled: true)
kubectl run -it --rm debug --image=postgres:15 --restart=Never -- \
psql -h DB_HOST -U DB_USER -d oauth_db -c '\l'
# Verify Wharf database exists (if wharf.enabled: true)
kubectl run -it --rm debug --image=postgres:15 --restart=Never -- \
psql -h DB_HOST -U DB_USER -d wharf -c '\l'
Storage/S3 Connection Issues
Diagnosis:
- Verify AWS credentials are correct
- Check bucket name and region
- Ensure IAM permissions allow bucket access
- For MinIO, verify endpoint URL is accessible
Image Pull Errors
Symptoms: ErrImagePull, ImagePullBackOff
Solutions:
- Verify image name and tag exist
- Check pull secret configuration
- Verify registry credentials
- Check network connectivity to registry
Ingress Not Accessible
Symptoms: Cannot reach application via ingress hostname
Diagnosis:
# Check ingress status
kubectl get ingress
kubectl describe ingress crewai-ingress
Solutions:
- Verify ingress controller is installed
- Check ingress className matches your controller
- Verify DNS points to ingress load balancer
- Check TLS certificate configuration
Out of Memory (OOMKilled)
Symptoms: Pods restarting with OOMKilled
Solutions:
- Increase memory limits
- Tune
WEB_CONCURRENCY and RAILS_MAX_THREADS
- Monitor actual memory usage
- Check for memory leaks
# Check memory usage patterns
kubectl top pods --sort-by=memory
# Increase memory limits in values.yaml
web:
resources:
limits:
memory: "16Gi" # Increase from 12Gi default
requests:
memory: "8Gi"
# Reduce concurrency to lower memory usage
envVars:
WEB_CONCURRENCY: 2 # Reduce worker processes
RAILS_MAX_THREADS: 3 # Reduce threads per process
BuildKit Build Failures
Symptoms: Crew builds fail, BuildKit errors in logs
Common Causes:
- Registry authentication issues
- Insufficient BuildKit resources
- Network connectivity problems
Solutions:
# Check BuildKit pod status
kubectl get pods -l app.kubernetes.io/component=buildkit
# View BuildKit logs
kubectl logs -l app.kubernetes.io/component=buildkit --tail=100
# Verify registry authentication
kubectl get secret docker-registry -o yaml
# Increase BuildKit resources if needed
buildkit:
resources:
limits:
cpu: "8"
memory: "16Gi"
# Configure insecure registry if using internal registry
buildkit:
registries:
- hostname: "registry.internal.company.com"
insecure: true
http: true
Persistent Volume Issues
Symptoms: Pods stuck in Pending state, PVC not binding
Solutions:
# Check PVC status
kubectl get pvc
# Describe PVC for events
kubectl describe pvc crewai-postgres-data-crewai-postgres-0
# Check StorageClass availability
kubectl get storageclass
# Check if default StorageClass is set
kubectl get storageclass -o jsonpath='{range .items[?(@.metadata.annotations.storageclass\.kubernetes\.io/is-default-class=="true")]}{.metadata.name}{"\n"}{end}'
# Set default StorageClass if needed
kubectl patch storageclass gp2 -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
Authentication Provider Issues
Symptoms: Unable to log in, OAuth errors
Solutions:
Entra ID Issues
# Verify secrets are set
kubectl get secret crewai-secrets -o jsonpath='{.data.ENTRA_ID_CLIENT_ID}' | base64 -d
kubectl get secret crewai-secrets -o jsonpath='{.data.ENTRA_ID_TENANT_ID}' | base64 -d
# Check redirect URI configuration
# Must match: https://YOUR_HOST/auth/entra_id/callback
# Verify APPLICATION_HOST matches your domain
kubectl exec deploy/crewai-web -- printenv APPLICATION_HOST
Okta Issues
# Verify Okta configuration
kubectl get secret crewai-secrets -o jsonpath='{.data.OKTA_SITE}' | base64 -d
kubectl get secret crewai-secrets -o jsonpath='{.data.OKTA_CLIENT_ID}' | base64 -d
# Check redirect URI in Okta
# Must match: https://YOUR_HOST/auth/okta/callback
Secret Management Issues
Symptoms: Pods fail to start, missing secret errors
Solutions:
# Check if secrets exist
kubectl get secret crewai-secrets
# Verify secret contents (be careful with output)
kubectl describe secret crewai-secrets
# For external secret store issues
kubectl get secretstore
kubectl get externalsecret
kubectl describe externalsecret crewai-external-secret
# Check External Secrets Operator logs
kubectl logs -n external-secrets-operator deployment/external-secrets
Expected Behavior: Pod Restarts After Secret Updates
Observation: Pods restart after running helm upgrade when secret values change
This is expected behavior. When you update secret values in your Helm values file (e.g., rotating credentials), the chart automatically triggers a rolling restart of all affected pods to ensure they pick up the new credentials. This is by design and prevents stale credentials from being used.
What to expect:
- Pods restart in a rolling fashion (no downtime)
- Each pod restarts once to load new secret values
- The restart happens automatically - no manual pod deletion needed
To verify the restart was successful:
# Check that all pods are running with new secret values
kubectl get pods -l app.kubernetes.io/name=crewai-platform
# View pod age to confirm restart
kubectl get pods -o wide
# Check logs to verify application started correctly
kubectl logs -l app.kubernetes.io/component=web --tail=50
Configuration Warnings
RAILS_MASTER_KEY Warning
Warning Message:
⚠️ WARNING: RAILS_MASTER_KEY detected in your configuration.
This key is automatically managed by the chart and should not be set manually.
Please remove RAILS_MASTER_KEY from your values.
Cause: You have manually configured RAILS_MASTER_KEY in either envVars or secrets in your values file.
Solution:
The chart automatically manages RAILS_MASTER_KEY and does not require manual configuration. Remove this setting from your values file:
# Remove these lines from your values.yaml:
envVars:
RAILS_MASTER_KEY: "..." # Remove this
secrets:
RAILS_MASTER_KEY: "..." # Remove this
Then upgrade your deployment:
helm upgrade crewai-platform \
oci://registry.crewai.com/crewai/stable/crewai-platform \
--values my-values.yaml
Why This Matters: The chart uses a different Rails configuration approach that doesn’t require RAILS_MASTER_KEY. Setting it manually can cause configuration conflicts.
Symptoms: Slow response times, high latency
Diagnostic Steps:
# Check resource utilization
kubectl top nodes
kubectl top pods
# Review slow query logs (if configured)
kubectl logs -l app.kubernetes.io/component=web | grep "Slow query"
Solutions:
- Scale web replicas:
web:
replicaCount: 4 # Increase from 2
- Increase resources:
web:
resources:
limits:
cpu: "8"
memory: "16Gi"
requests:
cpu: "2000m"
memory: "8Gi"
- Tune concurrency:
envVars:
WEB_CONCURRENCY: 4
RAILS_MAX_THREADS: 10
- Add database read replicas (configure in external database)
Support and Resources
Documentation
Generate Support Bundle
The support bundle collects comprehensive diagnostics for troubleshooting:
# Install the support-bundle plugin (if not already installed)
kubectl krew install support-bundle
# Generate support bundle (automatically detects cluster specs)
kubectl support-bundle --load-cluster-specs
# Support bundle includes:
# - All pod logs
# - Resource configurations
# - Cluster state
# - Event history
# - Secret names (not values)
The support bundle will be saved as a .tar.gz file that can be shared with CrewAI support for analysis.
Share the generated support bundle file with CrewAI support for faster issue resolution.
Quick Diagnostic Commands
# Check all CrewAI resources
kubectl get all -l app.kubernetes.io/name=crewai-platform
# View recent events
kubectl get events --sort-by='.lastTimestamp'
# Component logs
kubectl logs -l app.kubernetes.io/component=web --tail=100 -f
kubectl logs -l app.kubernetes.io/component=worker --tail=100 -f
kubectl logs -l app.kubernetes.io/component=buildkit --tail=100
# Check resource usage
kubectl top nodes
kubectl top pods
# Describe problematic pod
kubectl describe pod <POD_NAME>
For assistance with CrewAI Platform:
- Customer Portal:
https://enterprise.crewai.com/crewai
- Support Team: Contact your CrewAI representative
- Emergency Issues: Generate and share support bundle with your support team
- Release History:
https://enterprise.crewai.com/crewai/release-history