Skip to main content

Overview

This guide provides solutions for common issues encountered when deploying and operating CrewAI Platform on Kubernetes. For additional support, generate a support bundle and contact your CrewAI representative.

Diagnostic Commands

Before troubleshooting specific issues, gather diagnostic information:
# Check all CrewAI Platform resources
kubectl get all -l app.kubernetes.io/name=crewai-platform

# View recent events
kubectl get events --sort-by='.lastTimestamp' | head -20

# Check pod status
kubectl get pods -o wide

# View logs for specific component
kubectl logs -l app.kubernetes.io/component=web --tail=100
kubectl logs -l app.kubernetes.io/component=worker --tail=100

# Check resource usage
kubectl top nodes
kubectl top pods

Common Issues

Pod CrashLoopBackOff

Symptoms:
kubectl get pods
# NAME                    READY   STATUS             RESTARTS   AGE
# crewai-web-xxx          0/1     CrashLoopBackOff   5          3m
Common Causes:
  1. Missing required secrets
  2. Database connection failure
  3. Resource limits too restrictive
  4. Invalid configuration
Diagnosis:
kubectl logs POD_NAME
kubectl describe pod POD_NAME

Pod Startup Failures

Diagnosis:
  1. Check pod logs for errors
  2. Verify resource limits are not too restrictive
  3. Check secret availability
  4. Verify image pull secrets are configured
# Check pod status and logs
kubectl get pods
kubectl describe pod POD_NAME
kubectl logs POD_NAME

Database Connection Issues

Symptoms: Logs show could not connect to server: Connection refused Diagnosis: If the application cannot connect to the database:
  1. Verify DB_HOST is set correctly for external databases
  2. Check database credentials in secrets
  3. Ensure database allows connections from Kubernetes cluster
  4. Verify database name and port configuration
  5. Check security groups/firewall rules
# Check database connectivity from a pod
kubectl run -it --rm debug --image=postgres:15 --restart=Never -- \
  psql -h DB_HOST -U DB_USER -d POSTGRES_DB

Storage/S3 Connection Issues

Diagnosis:
  1. Verify AWS credentials are correct
  2. Check bucket name and region
  3. Ensure IAM permissions allow bucket access
  4. For MinIO, verify endpoint URL is accessible

Image Pull Errors

Symptoms: ErrImagePull, ImagePullBackOff Solutions:
  1. Verify image name and tag exist
  2. Check pull secret configuration
  3. Verify registry credentials
  4. Check network connectivity to registry

Ingress Not Accessible

Symptoms: Cannot reach application via ingress hostname Diagnosis:
# Check ingress status
kubectl get ingress
kubectl describe ingress crewai-ingress
Solutions:
  1. Verify ingress controller is installed
  2. Check ingress className matches your controller
  3. Verify DNS points to ingress load balancer
  4. Check TLS certificate configuration

Out of Memory (OOMKilled)

Symptoms: Pods restarting with OOMKilled Solutions:
  1. Increase memory limits
  2. Tune WEB_CONCURRENCY and RAILS_MAX_THREADS
  3. Monitor actual memory usage
  4. Check for memory leaks
# Check memory usage patterns
kubectl top pods --sort-by=memory

# Increase memory limits in values.yaml
web:
  resources:
    limits:
      memory: "16Gi"  # Increase from 12Gi default
    requests:
      memory: "8Gi"

# Reduce concurrency to lower memory usage
envVars:
  WEB_CONCURRENCY: 2  # Reduce worker processes
  RAILS_MAX_THREADS: 3  # Reduce threads per process

BuildKit Build Failures

Symptoms: Crew builds fail, BuildKit errors in logs Common Causes:
  1. Registry authentication issues
  2. Insufficient BuildKit resources
  3. Network connectivity problems
Solutions:
# Check BuildKit pod status
kubectl get pods -l app.kubernetes.io/component=buildkit

# View BuildKit logs
kubectl logs -l app.kubernetes.io/component=buildkit --tail=100

# Verify registry authentication
kubectl get secret docker-registry -o yaml

# Increase BuildKit resources if needed
buildkit:
  resources:
    limits:
      cpu: "8"
      memory: "16Gi"

# Configure insecure registry if using internal registry
buildkit:
  registries:
    - hostname: "registry.internal.company.com"
      insecure: true
      http: true

Persistent Volume Issues

Symptoms: Pods stuck in Pending state, PVC not binding Solutions:
# Check PVC status
kubectl get pvc

# Describe PVC for events
kubectl describe pvc crewai-postgres-data-crewai-postgres-0

# Check StorageClass availability
kubectl get storageclass

# Check if default StorageClass is set
kubectl get storageclass -o jsonpath='{range .items[?(@.metadata.annotations.storageclass\.kubernetes\.io/is-default-class=="true")]}{.metadata.name}{"\n"}{end}'

# Set default StorageClass if needed
kubectl patch storageclass gp2 -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

Authentication Provider Issues

Symptoms: Unable to log in, OAuth errors Solutions:

Entra ID Issues

# Verify secrets are set
kubectl get secret crewai-secrets -o jsonpath='{.data.ENTRA_ID_CLIENT_ID}' | base64 -d
kubectl get secret crewai-secrets -o jsonpath='{.data.ENTRA_ID_TENANT_ID}' | base64 -d

# Check redirect URI configuration
# Must match: https://YOUR_HOST/auth/entra_id/callback

# Verify APPLICATION_HOST matches your domain
kubectl exec deploy/crewai-web -- printenv APPLICATION_HOST

Okta Issues

# Verify Okta configuration
kubectl get secret crewai-secrets -o jsonpath='{.data.OKTA_SITE}' | base64 -d
kubectl get secret crewai-secrets -o jsonpath='{.data.OKTA_CLIENT_ID}' | base64 -d

# Check redirect URI in Okta
# Must match: https://YOUR_HOST/auth/okta/callback

Secret Management Issues

Symptoms: Pods fail to start, missing secret errors Solutions:
# Check if secrets exist
kubectl get secret crewai-secrets

# Verify secret contents (be careful with output)
kubectl describe secret crewai-secrets

# For external secret store issues
kubectl get secretstore
kubectl get externalsecret
kubectl describe externalsecret crewai-external-secret

# Check External Secrets Operator logs
kubectl logs -n external-secrets-operator deployment/external-secrets

Configuration Warnings

RAILS_MASTER_KEY Warning

Warning Message:
⚠️  WARNING: RAILS_MASTER_KEY detected in your configuration.
    This key is automatically managed by the chart and should not be set manually.

    Please remove RAILS_MASTER_KEY from your values.
Cause: You have manually configured RAILS_MASTER_KEY in either envVars or secrets in your values file. Solution: The chart automatically manages RAILS_MASTER_KEY and does not require manual configuration. Remove this setting from your values file:
# Remove these lines from your values.yaml:
envVars:
  RAILS_MASTER_KEY: "..."  # Remove this

secrets:
  RAILS_MASTER_KEY: "..."  # Remove this
Then upgrade your deployment:
helm upgrade crewai-platform \
  oci://registry.crewai.com/crewai/stable/crewai-platform \
  --values my-values.yaml
Why This Matters: The chart uses a different Rails configuration approach that doesn’t require RAILS_MASTER_KEY. Setting it manually can cause configuration conflicts.

Performance Issues

Symptoms: Slow response times, high latency Diagnostic Steps:
# Check resource utilization
kubectl top nodes
kubectl top pods

# Review slow query logs (if configured)
kubectl logs -l app.kubernetes.io/component=web | grep "Slow query"

Solutions:
  1. Scale web replicas:
web:
  replicaCount: 4  # Increase from 2
  1. Increase resources:
web:
  resources:
    limits:
      cpu: "8"
      memory: "16Gi"
    requests:
      cpu: "2000m"
      memory: "8Gi"
  1. Tune concurrency:
envVars:
  WEB_CONCURRENCY: 4
  RAILS_MAX_THREADS: 10
  1. Add database read replicas (configure in external database)

Support and Resources

Documentation

Generate Support Bundle

The support bundle collects comprehensive diagnostics for troubleshooting:
# Install the support-bundle plugin (if not already installed)
kubectl krew install support-bundle

# Generate support bundle (automatically detects cluster specs)
kubectl support-bundle --load-cluster-specs

# Support bundle includes:
# - All pod logs
# - Resource configurations
# - Cluster state
# - Event history
# - Secret names (not values)
The support bundle will be saved as a .tar.gz file that can be shared with CrewAI support for analysis.
Share the generated support bundle file with CrewAI support for faster issue resolution.

Quick Diagnostic Commands

# Check all CrewAI resources
kubectl get all -l app.kubernetes.io/name=crewai-platform

# View recent events
kubectl get events --sort-by='.lastTimestamp'

# Component logs
kubectl logs -l app.kubernetes.io/component=web --tail=100 -f
kubectl logs -l app.kubernetes.io/component=worker --tail=100 -f
kubectl logs -l app.kubernetes.io/component=buildkit --tail=100

# Check resource usage
kubectl top nodes
kubectl top pods

# Describe problematic pod
kubectl describe pod <POD_NAME>

Contact Support

For assistance with CrewAI Platform:
  • Customer Portal: https://enterprise.crewai.com/crewai
  • Support Team: Contact your CrewAI representative
  • Emergency Issues: Generate and share support bundle with your support team
  • Release History: https://enterprise.crewai.com/crewai/release-history