Skip to main content

Documentation Index

Fetch the complete documentation index at: https://enterprise-docs.crewai.com/llms.txt

Use this file to discover all available pages before exploring further.

Overview

GET /health/debug is a comprehensive diagnostic endpoint exposed by every Factory deployment. Run it once after install and after any configuration change to confirm that the underlying components — database, authentication, Studio, tools, traces, OAuth, LLM connections, and platform pods — are all reachable and responsive. It is not designed for continuous polling. Use the lightweight GET /health endpoint (a cheap database ping) for liveness probes.
Available from Helm chart v0.6.11 onwards. Earlier installs do not expose /health/debug or the admin tab; upgrade the chart first.
An equivalent admin UI is available at /admin/factory_health for users with the factory-admin role. It renders the same data with status pills, latency badges, and per-deployment tables.

Availability

The /health/debug endpoint and matching admin page are only enabled on Kubernetes-based Factory deployments. Several of the underlying probes (in-cluster Service URLs, Deployment readiness checks, pod-level fingerprint comparison) only make sense on Kubernetes and would emit confidently wrong results on other provisioners.
ProviderEndpoint enabled
Kubernetes (KOTS, plain Helm)Yes
EKS (Helm-managed)Yes
AWS ECS / FargateNo (returns 404)
Local developmentNo
On unsupported providers the endpoint returns 404 Not Found and the admin tab redirects to the dashboard with an explanatory message. This is intentional — silent disablement is preferable to a green status pill that does not reflect the deployment topology.

What It Checks

ComponentWhat it verifies
databasePrimary PostgreSQL connection is reachable and responsive.
studioThe Studio subsystem is healthy end-to-end across four layers: (1) core tables (Studio::Project, Studio::ProjectVersion, Studio::Execution) are queryable; (2) the Studio v2 HTTP surface (/studio/v2) is mounted in the router; (3) the internal “CrewAI” organization exists with both Assistant and Runner deployment records, all reporting Crew is Online; (4) the per-crew Kubernetes Deployments backing those records exist with at least one Ready pod each. Namespace resolution mirrors what the provisioner would do today: K8S_NAMESPACE when the k8s_namespace_isolation feature flag is off, <K8S_NAMESPACE>-org-<id> when it is on. See the legacy-install caveat in the studio crew(s) without Ready pods troubleshooting section below.
authThe configured identity provider (Okta, Auth0, Entra ID, Keycloak, WorkOS, or local) is reachable via its OIDC discovery endpoint. No real login is attempted.
toolsThe internal tool repository is provisioned and the PyPI-compatible tools index route is mounted (used by deployed crews to resolve tool packages).
wharfThe Wharf OTEL/traces backend is reachable from this pod. Skipped when Wharf is disabled (wharf.enabled: false).
oauthThe CrewAI OAuth integration service is reachable, and CREWAI_OAUTH_API_KEY is present on this pod. Also emits a non-reversible API-key fingerprint so operators can confirm the key is consistent across all pods. Skipped when CrewAI OAuth is not configured.
llm_connectionsEach configured LLM connection has its required credentials (e.g. OPENAI_API_KEY, ANTHROPIC_API_KEY) set, and each unique provider’s public API is reachable. Providers with tenant-specific base URLs (Azure, Bedrock, Snowflake Cortex, Vertex, custom OpenAI-compatible) are validated for credentials only. Returns skipped on a fresh install with no connections yet.
platform_podsLists every long-lived Deployment shipped by the platform Helm chart (web, worker, oauth, buildkit, wharf, internal-registry) in the install namespace and verifies that readyReplicas equals spec.replicas for each. Catches partial-outage cases where the Service routes to one healthy pod while sibling replicas are CrashLoopBackOff. Requires the chart’s default RBAC; returns skipped outside Kubernetes.

Calling the Endpoint

/health/debug requires authorization on every request. Unauthorized callers receive 404 Not Found — the endpoint’s existence is deliberately not advertised. This prevents the probe from being used as an unauthenticated amplifier for LLM-provider calls, Kubernetes API calls, and in-cluster service probes.For lightweight anonymous health checks (load-balancer probes, Kubernetes readiness probes, external uptime monitors), use GET /health instead — it performs only a cheap database ping.
Two caller identities are accepted:
  • Programmatic access — present the FACTORY_DEBUG_TOKEN via the X-Factory-Debug-Token header. The token is auto-generated and stored in the platform Secret; see The Debug Token below for how to retrieve it.
  • Interactive access — sign in with the factory-admin role and navigate to /admin/factory_health, which calls the same probe.
TOKEN=$(kubectl get secret <release-name>-secrets -n <namespace> \
  -o jsonpath='{.data.FACTORY_DEBUG_TOKEN}' | base64 -d)

curl -H "X-Factory-Debug-Token: $TOKEN" \
  https://<your-factory-host>/health/debug
{
  "status": "degraded",
  "duration_ms": 482,
  "environment": {
    "factory": true,
    "auth_provider": "okta",
    "version": "2026.04.22-abc123"
  },
  "components": {
    "database": { "status": "ok", "latency_ms": 12, "checks": ["select_1"] },
    "studio": {
      "status": "ok",
      "latency_ms": 78,
      "checks": [
        "project_table_readable",
        "studio_v2_route_mounted",
        "agent_install_complete",
        "runner_install_complete",
        "crew_pods_ready"
      ],
      "details": {
        "internal_organization": "CrewAI",
        "internal_deployments": [
          { "id": 100, "slug": "studio-v2-assistant", "status": "Crew is Online", "ready": true },
          { "id": 101, "slug": "studio-v2-runner",    "status": "Crew is Online", "ready": true }
        ],
        "crew_pods_status": "ok",
        "crew_pods": {
          "namespaces": ["crewai-crews"],
          "deployment_count": 2,
          "ready_count": 2,
          "deployments": [
            { "slug": "studio-v2-assistant", "app_id": "org-42-crew-100-studio-v2-assistant", "namespace": "crewai-crews", "healthy": true },
            { "slug": "studio-v2-runner",    "app_id": "org-42-crew-101-studio-v2-runner",    "namespace": "crewai-crews", "healthy": true }
          ]
        }
      }
    },
    "auth":  { "status": "ok",   "latency_ms": 210, "provider": "okta", "http_status": 200 },
    "tools": { "status": "fail", "latency_ms": 3,   "error": "public tool repository not provisioned" },
    "oauth": {
      "status": "ok",
      "latency_ms": 95,
      "details": {
        "host": "oauth.example.com",
        "api_key_configured": true,
        "api_key_fingerprint": "3c1a9e0f4b22",
        "pod": "crewai-web-5d6c7f-abcd1"
      }
    }
  }
}

Interpreting the Verdict

Top-level statusMeaningHTTP code
okAll components healthy. The deployment is fully operational.200
degradedA non-critical component (Studio, Tools, Wharf, OAuth, LLM connections, platform pods) has failed. The deployment is up but some functionality may be impaired.200
failA critical component (database or authentication) has failed. The deployment cannot serve users.503
A component with status: skipped is not a failure — it means the component is not configured for this deployment (for example, Wharf traces are off, or CrewAI OAuth is intentionally disabled). Skipped components do not affect the top-level verdict.

The Debug Token

FACTORY_DEBUG_TOKEN authorizes the X-Factory-Debug-Token header path on /health/debug. It is managed by the chart the same way Rails encryption keys are:
  • Auto-generated on first install using randAlphaNum 64 and stored in the platform Secret.
  • Persisted across helm upgrade via the chart’s lookup function, so every pod across every rollout sees the same value.
  • Overridable by setting secrets.FACTORY_DEBUG_TOKEN in your values file (recommended for ArgoCD users — see ArgoCD Deployment).
See secrets.FACTORY_DEBUG_TOKEN for full details.

Retrieving the Current Token

kubectl get secret <release-name>-secrets -n <namespace> \
  -o jsonpath='{.data.FACTORY_DEBUG_TOKEN}' | base64 -d
Use the returned value as the X-Factory-Debug-Token header:
TOKEN=$(kubectl get secret <release-name>-secrets -n <namespace> \
  -o jsonpath='{.data.FACTORY_DEBUG_TOKEN}' | base64 -d)

curl -H "X-Factory-Debug-Token: $TOKEN" \
  https://<your-factory-host>/health/debug

Rotating the Token

Set a new value in secrets.FACTORY_DEBUG_TOKEN and run helm upgrade. The chart’s secret-change detection triggers a rolling restart of the platform pods, and the new token is picked up automatically with no downtime. To force regeneration of a random value on the next install only, unset the override and delete the key from the existing Secret before upgrading — otherwise lookup will preserve the current value across upgrades.

Verifying OAuth API Keys Are Consistent Across Pods

The oauth component emits two fields specifically for multi-pod operators:
  • api_key_fingerprint — the first 12 hex chars of SHA256(CREWAI_OAUTH_API_KEY). Non-reversible. Identical keys produce identical fingerprints; different keys produce different fingerprints.
  • pod — the hostname of the pod that served the request.
To confirm every web pod is running with the same API key, call the endpoint against each pod directly (bypassing the load balancer) and compare fingerprints:
for pod in $(kubectl get pods -l app.kubernetes.io/component=web -o name); do
  kubectl exec -n <namespace> $pod -- \
    curl -s -H "X-Factory-Debug-Token: $FACTORY_DEBUG_TOKEN" \
    http://localhost:3000/health/debug \
    | jq '.components.oauth.details | {pod, api_key_fingerprint}'
done
A mismatch indicates that one or more pods were started with a different CREWAI_OAUTH_API_KEY value — usually because a rolling deploy picked up a stale secret, or because of a mid-rollout environment override. Restart the affected pods after correcting the secret source.

Troubleshooting

database fails

PostgreSQL is unreachable or credentials are wrong. Check the pod’s DATABASE_URL, NetworkPolicy, and the database’s own readiness.

auth fails with ConnectionFailed or TimeoutError

The configured identity provider cannot be reached from inside the cluster. Check egress NetworkPolicy and the provider configuration (OKTA_SITE, AUTH0_DOMAIN, KEYCLOAK_SITE, etc.).

auth fails with HTTP 4xx/5xx

The IdP is reachable but its OIDC discovery endpoint is not responding correctly. Verify the authorization server URL and tenant configuration.

studio fails with internal organization not found

The studio:install_internal_organization rake task was never run, or the internal “CrewAI” organization was deleted. Without it the Assistant and Runner crews cannot be installed. See Post-Installation for the install command.

studio fails with agent install incomplete or runner install incomplete

The corresponding studio:agent:install or studio:runner:install task either was never run or has not completed successfully. The details.agent / details.runner payload shows which slugs are expected, which exist, and which are not yet Crew is Online. Re-run the matching install task.

studio fails with studio crew(s) without Ready pods

The Deployment record reports Crew is Online but the corresponding Kubernetes Deployment either does not exist or has zero Ready pods. This is the classic “DB record went stale after the pods were deleted by hand” failure mode. The error message names the namespace each unhealthy crew was expected in — check details.crew_pods.namespaces to see every namespace the probe consulted, and the per-row namespace field in details.crew_pods.deployments to pinpoint a specific crew.
Legacy installs that turned k8s_namespace_isolation on after the fact will see false failures here. The probe reads the feature flag at probe time and assumes every crew lives in the namespace the current flag setting would route it to — it does not inspect each crew’s deployment history. So an install that was provisioned with the flag off (crews in K8S_NAMESPACE) and later flipped the flag on will have its Studio crews reported as missing in <K8S_NAMESPACE>-org-<id>. The k8s_namespace_isolation feature is fresh-install-only by design — see Multi-org Namespace Isolation for the supported rollout path. If you are on a legacy install in this state, the only correct fix is to re-provision the affected crews into the per-org namespaces; turning the flag back off will not restore them either.
To recover from a normal “stale DB record” failure:
kubectl -n <crew-namespace> get deploy -l app-id=<expected-app-id>
If the result is empty, re-run the matching studio:*:install task to re-provision.

studio’s crew_pods_status is skipped

This pod is not running inside Kubernetes, or the crew namespace environment variable is not set. The DB, route, and install verification still ran — only the pod-level cross-check was skipped. Set the crew namespace on the platform pod (typically crewai-crews) and the probe will start querying it.

tools fails with public tool repository not provisioned

The seed migration that creates the default public tool repository did not run. Run the database seed task as part of the install.

wharf fails or is skipped

Skipped is expected when Wharf is disabled (wharf.enabled: false). A ConnectionFailed or TimeoutError means WHARF_URL is set but the service cannot be reached — check egress NetworkPolicy, DNS, and that the Wharf service itself is running.

oauth fails with CREWAI_OAUTH_API_KEY is not configured on this pod

The pod is missing the OAuth API key. Check the pod’s environment and the secret source used to provision it. If this affects only some pods, perform the fingerprint comparison above to identify the affected pods.

oauth is skipped

CREWAI_OAUTH_API_BASE_URL is not set. Expected for Factory deployments that do not use third-party OAuth connectors (Slack, HubSpot, Gmail, etc.).

llm_connections is degraded with missing required credentials

One or more LLM connections exist but are missing the environment variables required for their provider (for example, an OpenAI connection without OPENAI_API_KEY). The failing connection IDs and missing keys appear under details.missing_credentials. Fix by setting the env vars on the connection through the LLM Connections UI, or on the organization’s environment-variable scope.

llm_connections is degraded with provider(s) unreachable

A provider’s public API cannot be reached from this pod. Check outbound NetworkPolicy and DNS. A 401/403 from the provider counts as reachable — the provider answered. Only timeouts and connection failures count as unreachable.

llm_connections is skipped

No LLM connections exist yet. Expected on a fresh install or before any organization has configured an LLM.

platform_pods is degraded with N deployment(s) not Ready

One or more platform Deployments has fewer Ready replicas than desired. The offending names appear in the error message; the full per-deployment table is under details.deployments. Common causes:
  • Image pull failures — kubectl describe deploy <name> -n <namespace>
  • Insufficient resources on the node
  • A recently failed rollout — see details.deployments[*].condition
The chart never ships components at zero replicas, so a desired_replicas: 0 row also fails this check (somebody scaled it down by hand).

platform_pods fails with kubernetes api error: HTTP 403

The platform service account cannot list Deployments in its own namespace. This typically means the chart’s RBAC was disabled (rbac.create=false) or the platform Role was modified. Re-enable the bundled RBAC or grant list, get on apps/v1 Deployments in the install namespace.

platform_pods is skipped

The probe is not running inside a Kubernetes pod. Expected in dev or test environments — and on those topologies the entire /health/debug endpoint is also disabled (see Availability).

Operational Notes

  • All checks run in parallel with a 10-second overall budget and a 3-second per-check cap.
  • No check issues writes; the probe is safe to run at any time.
  • The endpoint is registered only when CREWAI_FACTORY=true. SaaS deployments return 404.
  • For continuous liveness probes (e.g. Kubernetes readiness checks), use GET /health instead.