Troubleshooting

Overview

This guide provides solutions for common issues encountered when deploying and operating CrewAI Platform on Kubernetes. For additional support, generate a support bundle and contact your CrewAI representative.

Diagnostic Commands

Before troubleshooting specific issues, gather diagnostic information:

# Check all CrewAI Platform resources
kubectl get all -l app.kubernetes.io/name=crewai-platform

# View recent events
kubectl get events --sort-by='.lastTimestamp' | head -20

# Check pod status
kubectl get pods -o wide

# View logs for specific component
kubectl logs -l app.kubernetes.io/component=web --tail=100
kubectl logs -l app.kubernetes.io/component=worker --tail=100

# Check resource usage
kubectl top nodes
kubectl top pods

Common Issues

Pod CrashLoopBackOff

Symptoms:

kubectl get pods
# NAME                    READY   STATUS             RESTARTS   AGE
# crewai-web-xxx          0/1     CrashLoopBackOff   5          3m

Common Causes:

Missing required secrets
Database connection failure
Resource limits too restrictive
Invalid configuration

Diagnosis:

kubectl logs POD_NAME
kubectl describe pod POD_NAME

Pod Startup Failures

Diagnosis:

Check pod logs for errors
Verify resource limits are not too restrictive
Check secret availability
Verify image pull secrets are configured

# Check pod status and logs
kubectl get pods
kubectl describe pod POD_NAME
kubectl logs POD_NAME

Database Connection Issues

Symptoms: Logs show could not connect to server: Connection refused Diagnosis: If the application cannot connect to the database:

Verify DB_HOST is set correctly for external databases
Check database credentials in secrets
Ensure database allows connections from Kubernetes cluster
Verify database name and port configuration
Check security groups/firewall rules
If using Built-in Integrations (oauth.enabled: true) with an external database, ensure the OAuth database exists (default: oauth_db)

# Check database connectivity from a pod
kubectl run -it --rm debug --image=postgres:15 --restart=Never -- \
  psql -h DB_HOST -U DB_USER -d POSTGRES_DB

# Verify OAuth database exists (if oauth.enabled: true)
kubectl run -it --rm debug --image=postgres:15 --restart=Never -- \
  psql -h DB_HOST -U DB_USER -d oauth_db -c '\l'

Storage/S3 Connection Issues

Diagnosis:

Verify AWS credentials are correct
Check bucket name and region
Ensure IAM permissions allow bucket access
For MinIO, verify endpoint URL is accessible

Image Pull Errors

Symptoms: ErrImagePull, ImagePullBackOff Solutions:

Verify image name and tag exist
Check pull secret configuration
Verify registry credentials
Check network connectivity to registry

Ingress Not Accessible

Symptoms: Cannot reach application via ingress hostname Diagnosis:

# Check ingress status
kubectl get ingress
kubectl describe ingress crewai-ingress

Solutions:

Verify ingress controller is installed
Check ingress className matches your controller
Verify DNS points to ingress load balancer
Check TLS certificate configuration

Out of Memory (OOMKilled)

Symptoms: Pods restarting with OOMKilled Solutions:

Increase memory limits
Tune WEB_CONCURRENCY and RAILS_MAX_THREADS
Monitor actual memory usage
Check for memory leaks

# Check memory usage patterns
kubectl top pods --sort-by=memory

# Increase memory limits in values.yaml
web:
  resources:
    limits:
      memory: "16Gi"  # Increase from 12Gi default
    requests:
      memory: "8Gi"

# Reduce concurrency to lower memory usage
envVars:
  WEB_CONCURRENCY: 2  # Reduce worker processes
  RAILS_MAX_THREADS: 3  # Reduce threads per process

BuildKit Build Failures

Symptoms: Crew builds fail, BuildKit errors in logs Common Causes:

Registry authentication issues
Insufficient BuildKit resources
Network connectivity problems

Solutions:

# Check BuildKit pod status
kubectl get pods -l app.kubernetes.io/component=buildkit

# View BuildKit logs
kubectl logs -l app.kubernetes.io/component=buildkit --tail=100

# Verify registry authentication
kubectl get secret docker-registry -o yaml

# Increase BuildKit resources if needed
buildkit:
  resources:
    limits:
      cpu: "8"
      memory: "16Gi"

# Configure insecure registry if using internal registry
buildkit:
  registries:
    - hostname: "registry.internal.company.com"
      insecure: true
      http: true

Persistent Volume Issues

Symptoms: Pods stuck in Pending state, PVC not binding Solutions:

# Check PVC status
kubectl get pvc

# Describe PVC for events
kubectl describe pvc crewai-postgres-data-crewai-postgres-0

# Check StorageClass availability
kubectl get storageclass

# Check if default StorageClass is set
kubectl get storageclass -o jsonpath='{range .items[?(@.metadata.annotations.storageclass\.kubernetes\.io/is-default-class=="true")]}{.metadata.name}{"\n"}{end}'

# Set default StorageClass if needed
kubectl patch storageclass gp2 -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

Authentication Provider Issues

Symptoms: Unable to log in, OAuth errors Solutions:

Entra ID Issues

# Verify secrets are set
kubectl get secret crewai-secrets -o jsonpath='{.data.ENTRA_ID_CLIENT_ID}' | base64 -d
kubectl get secret crewai-secrets -o jsonpath='{.data.ENTRA_ID_TENANT_ID}' | base64 -d

# Check redirect URI configuration
# Must match: https://YOUR_HOST/auth/entra_id/callback

# Verify APPLICATION_HOST matches your domain
kubectl exec deploy/crewai-web -- printenv APPLICATION_HOST

Okta Issues

# Verify Okta configuration
kubectl get secret crewai-secrets -o jsonpath='{.data.OKTA_SITE}' | base64 -d
kubectl get secret crewai-secrets -o jsonpath='{.data.OKTA_CLIENT_ID}' | base64 -d

# Check redirect URI in Okta
# Must match: https://YOUR_HOST/auth/okta/callback

Secret Management Issues

Symptoms: Pods fail to start, missing secret errors Solutions:

# Check if secrets exist
kubectl get secret crewai-secrets

# Verify secret contents (be careful with output)
kubectl describe secret crewai-secrets

# For external secret store issues
kubectl get secretstore
kubectl get externalsecret
kubectl describe externalsecret crewai-external-secret

# Check External Secrets Operator logs
kubectl logs -n external-secrets-operator deployment/external-secrets

Configuration Warnings

RAILS_MASTER_KEY Warning

Warning Message:

⚠️  WARNING: RAILS_MASTER_KEY detected in your configuration.
    This key is automatically managed by the chart and should not be set manually.

    Please remove RAILS_MASTER_KEY from your values.

Cause: You have manually configured RAILS_MASTER_KEY in either envVars or secrets in your values file. Solution: The chart automatically manages RAILS_MASTER_KEY and does not require manual configuration. Remove this setting from your values file:

# Remove these lines from your values.yaml:
envVars:
  RAILS_MASTER_KEY: "..."  # Remove this

secrets:
  RAILS_MASTER_KEY: "..."  # Remove this

Then upgrade your deployment:

helm upgrade crewai-platform \
  oci://registry.crewai.com/crewai/stable/crewai-platform \
  --values my-values.yaml

Why This Matters: The chart uses a different Rails configuration approach that doesn’t require RAILS_MASTER_KEY. Setting it manually can cause configuration conflicts.

Performance Issues

Symptoms: Slow response times, high latency Diagnostic Steps:

# Check resource utilization
kubectl top nodes
kubectl top pods

# Review slow query logs (if configured)
kubectl logs -l app.kubernetes.io/component=web | grep "Slow query"

Solutions:

Scale web replicas:

web:
  replicaCount: 4  # Increase from 2

Increase resources:

web:
  resources:
    limits:
      cpu: "8"
      memory: "16Gi"
    requests:
      cpu: "2000m"
      memory: "8Gi"

Tune concurrency:

envVars:
  WEB_CONCURRENCY: 4
  RAILS_MAX_THREADS: 10

Add database read replicas (configure in external database)

Support and Resources

Documentation

Installation Guide: Installation
Configuration Guide: Configuration
Configuration Reference: Reference

Generate Support Bundle

The support bundle collects comprehensive diagnostics for troubleshooting:

# Install the support-bundle plugin (if not already installed)
kubectl krew install support-bundle

# Generate support bundle (automatically detects cluster specs)
kubectl support-bundle --load-cluster-specs

# Support bundle includes:
# - All pod logs
# - Resource configurations
# - Cluster state
# - Event history
# - Secret names (not values)

The support bundle will be saved as a .tar.gz file that can be shared with CrewAI support for analysis.

Share the generated support bundle file with CrewAI support for faster issue resolution.

Quick Diagnostic Commands

# Check all CrewAI resources
kubectl get all -l app.kubernetes.io/name=crewai-platform

# View recent events
kubectl get events --sort-by='.lastTimestamp'

# Component logs
kubectl logs -l app.kubernetes.io/component=web --tail=100 -f
kubectl logs -l app.kubernetes.io/component=worker --tail=100 -f
kubectl logs -l app.kubernetes.io/component=buildkit --tail=100

# Check resource usage
kubectl top nodes
kubectl top pods

# Describe problematic pod
kubectl describe pod <POD_NAME>

Contact Support

For assistance with CrewAI Platform:

Customer Portal: https://enterprise.crewai.com/crewai
Support Team: Contact your CrewAI representative
Emergency Issues: Generate and share support bundle with your support team
Release History: https://enterprise.crewai.com/crewai/release-history

Installation

Configuration

Cloud Providers

Operations

Feature Guides

Reference

Overview

Diagnostic Commands

Common Issues

Pod CrashLoopBackOff

Pod Startup Failures

Database Connection Issues

Storage/S3 Connection Issues

Image Pull Errors

Ingress Not Accessible

Out of Memory (OOMKilled)

BuildKit Build Failures

Persistent Volume Issues

Authentication Provider Issues

Entra ID Issues

Okta Issues

Secret Management Issues

Configuration Warnings

RAILS_MASTER_KEY Warning

Performance Issues

Support and Resources

Documentation

Generate Support Bundle

Quick Diagnostic Commands

Contact Support

Installation

Configuration

Cloud Providers

Operations

Feature Guides

Reference

​Overview

​Diagnostic Commands

​Common Issues

​Pod CrashLoopBackOff

​Pod Startup Failures

​Database Connection Issues

​Storage/S3 Connection Issues

​Image Pull Errors

​Ingress Not Accessible

​Out of Memory (OOMKilled)

​BuildKit Build Failures

​Persistent Volume Issues

​Authentication Provider Issues

​Entra ID Issues

​Okta Issues

​Secret Management Issues

​Configuration Warnings

​RAILS_MASTER_KEY Warning

​Performance Issues

​Support and Resources

​Documentation

​Generate Support Bundle

​Quick Diagnostic Commands

​Contact Support

Overview

Diagnostic Commands

Common Issues

Pod CrashLoopBackOff

Pod Startup Failures

Database Connection Issues

Storage/S3 Connection Issues

Image Pull Errors

Ingress Not Accessible

Out of Memory (OOMKilled)

BuildKit Build Failures

Persistent Volume Issues

Authentication Provider Issues

Entra ID Issues

Okta Issues

Secret Management Issues

Configuration Warnings

RAILS_MASTER_KEY Warning

Performance Issues

Support and Resources

Documentation

Generate Support Bundle

Quick Diagnostic Commands

Contact Support