Documentation Index
Fetch the complete documentation index at: https://enterprise-docs.crewai.com/llms.txt
Use this file to discover all available pages before exploring further.
Overview
GET /health/debug is a comprehensive diagnostic endpoint exposed by every Factory deployment. Run it once after install and after any configuration change to confirm that the underlying components — database, authentication, Studio, tools, traces, OAuth, LLM connections, and platform pods — are all reachable and responsive.
It is not designed for continuous polling. Use the lightweight GET /health endpoint (a cheap database ping) for liveness probes.
Available from Helm chart v0.6.11 onwards. Earlier installs do not expose
/health/debug or the admin tab; upgrade the chart first.An equivalent admin UI is available at
/admin/factory_health for users with the factory-admin role. It renders the same data with status pills, latency badges, and per-deployment tables.Availability
The/health/debug endpoint and matching admin page are only enabled on Kubernetes-based Factory deployments. Several of the underlying probes (in-cluster Service URLs, Deployment readiness checks, pod-level fingerprint comparison) only make sense on Kubernetes and would emit confidently wrong results on other provisioners.
| Provider | Endpoint enabled |
|---|---|
| Kubernetes (KOTS, plain Helm) | Yes |
| EKS (Helm-managed) | Yes |
| AWS ECS / Fargate | No (returns 404) |
| Local development | No |
404 Not Found and the admin tab redirects to the dashboard with an explanatory message. This is intentional — silent disablement is preferable to a green status pill that does not reflect the deployment topology.
What It Checks
| Component | What it verifies |
|---|---|
database | Primary PostgreSQL connection is reachable and responsive. |
studio | The Studio subsystem is healthy end-to-end across four layers: (1) core tables (Studio::Project, Studio::ProjectVersion, Studio::Execution) are queryable; (2) the Studio v2 HTTP surface (/studio/v2) is mounted in the router; (3) the internal “CrewAI” organization exists with both Assistant and Runner deployment records, all reporting Crew is Online; (4) the per-crew Kubernetes Deployments backing those records exist with at least one Ready pod each. Namespace resolution mirrors what the provisioner would do today: K8S_NAMESPACE when the k8s_namespace_isolation feature flag is off, <K8S_NAMESPACE>-org-<id> when it is on. See the legacy-install caveat in the studio crew(s) without Ready pods troubleshooting section below. |
auth | The configured identity provider (Okta, Auth0, Entra ID, Keycloak, WorkOS, or local) is reachable via its OIDC discovery endpoint. No real login is attempted. |
tools | The internal tool repository is provisioned and the PyPI-compatible tools index route is mounted (used by deployed crews to resolve tool packages). |
wharf | The Wharf OTEL/traces backend is reachable from this pod. Skipped when Wharf is disabled (wharf.enabled: false). |
oauth | The CrewAI OAuth integration service is reachable, and CREWAI_OAUTH_API_KEY is present on this pod. Also emits a non-reversible API-key fingerprint so operators can confirm the key is consistent across all pods. Skipped when CrewAI OAuth is not configured. |
llm_connections | Each configured LLM connection has its required credentials (e.g. OPENAI_API_KEY, ANTHROPIC_API_KEY) set, and each unique provider’s public API is reachable. Providers with tenant-specific base URLs (Azure, Bedrock, Snowflake Cortex, Vertex, custom OpenAI-compatible) are validated for credentials only. Returns skipped on a fresh install with no connections yet. |
platform_pods | Lists every long-lived Deployment shipped by the platform Helm chart (web, worker, oauth, buildkit, wharf, internal-registry) in the install namespace and verifies that readyReplicas equals spec.replicas for each. Catches partial-outage cases where the Service routes to one healthy pod while sibling replicas are CrashLoopBackOff. Requires the chart’s default RBAC; returns skipped outside Kubernetes. |
Calling the Endpoint
Two caller identities are accepted:- Programmatic access — present the
FACTORY_DEBUG_TOKENvia theX-Factory-Debug-Tokenheader. The token is auto-generated and stored in the platform Secret; see The Debug Token below for how to retrieve it. - Interactive access — sign in with the
factory-adminrole and navigate to/admin/factory_health, which calls the same probe.
Interpreting the Verdict
Top-level status | Meaning | HTTP code |
|---|---|---|
ok | All components healthy. The deployment is fully operational. | 200 |
degraded | A non-critical component (Studio, Tools, Wharf, OAuth, LLM connections, platform pods) has failed. The deployment is up but some functionality may be impaired. | 200 |
fail | A critical component (database or authentication) has failed. The deployment cannot serve users. | 503 |
status: skipped is not a failure — it means the component is not configured for this deployment (for example, Wharf traces are off, or CrewAI OAuth is intentionally disabled). Skipped components do not affect the top-level verdict.
The Debug Token
FACTORY_DEBUG_TOKEN authorizes the X-Factory-Debug-Token header path on /health/debug. It is managed by the chart the same way Rails encryption keys are:
- Auto-generated on first install using
randAlphaNum 64and stored in the platform Secret. - Persisted across
helm upgradevia the chart’slookupfunction, so every pod across every rollout sees the same value. - Overridable by setting
secrets.FACTORY_DEBUG_TOKENin your values file (recommended for ArgoCD users — see ArgoCD Deployment).
secrets.FACTORY_DEBUG_TOKEN for full details.
Retrieving the Current Token
X-Factory-Debug-Token header:
Rotating the Token
Set a new value insecrets.FACTORY_DEBUG_TOKEN and run helm upgrade. The chart’s secret-change detection triggers a rolling restart of the platform pods, and the new token is picked up automatically with no downtime. To force regeneration of a random value on the next install only, unset the override and delete the key from the existing Secret before upgrading — otherwise lookup will preserve the current value across upgrades.
Verifying OAuth API Keys Are Consistent Across Pods
Theoauth component emits two fields specifically for multi-pod operators:
api_key_fingerprint— the first 12 hex chars ofSHA256(CREWAI_OAUTH_API_KEY). Non-reversible. Identical keys produce identical fingerprints; different keys produce different fingerprints.pod— the hostname of the pod that served the request.
CREWAI_OAUTH_API_KEY value — usually because a rolling deploy picked up a stale secret, or because of a mid-rollout environment override. Restart the affected pods after correcting the secret source.
Troubleshooting
database fails
PostgreSQL is unreachable or credentials are wrong. Check the pod’s DATABASE_URL, NetworkPolicy, and the database’s own readiness.
auth fails with ConnectionFailed or TimeoutError
The configured identity provider cannot be reached from inside the cluster. Check egress NetworkPolicy and the provider configuration (OKTA_SITE, AUTH0_DOMAIN, KEYCLOAK_SITE, etc.).
auth fails with HTTP 4xx/5xx
The IdP is reachable but its OIDC discovery endpoint is not responding correctly. Verify the authorization server URL and tenant configuration.
studio fails with internal organization not found
The studio:install_internal_organization rake task was never run, or the internal “CrewAI” organization was deleted. Without it the Assistant and Runner crews cannot be installed. See Post-Installation for the install command.
studio fails with agent install incomplete or runner install incomplete
The corresponding studio:agent:install or studio:runner:install task either was never run or has not completed successfully. The details.agent / details.runner payload shows which slugs are expected, which exist, and which are not yet Crew is Online. Re-run the matching install task.
studio fails with studio crew(s) without Ready pods
The Deployment record reports Crew is Online but the corresponding Kubernetes Deployment either does not exist or has zero Ready pods. This is the classic “DB record went stale after the pods were deleted by hand” failure mode. The error message names the namespace each unhealthy crew was expected in — check details.crew_pods.namespaces to see every namespace the probe consulted, and the per-row namespace field in details.crew_pods.deployments to pinpoint a specific crew.
To recover from a normal “stale DB record” failure:
studio:*:install task to re-provision.
studio’s crew_pods_status is skipped
This pod is not running inside Kubernetes, or the crew namespace environment variable is not set. The DB, route, and install verification still ran — only the pod-level cross-check was skipped. Set the crew namespace on the platform pod (typically crewai-crews) and the probe will start querying it.
tools fails with public tool repository not provisioned
The seed migration that creates the default public tool repository did not run. Run the database seed task as part of the install.
wharf fails or is skipped
Skipped is expected when Wharf is disabled (wharf.enabled: false). A ConnectionFailed or TimeoutError means WHARF_URL is set but the service cannot be reached — check egress NetworkPolicy, DNS, and that the Wharf service itself is running.
oauth fails with CREWAI_OAUTH_API_KEY is not configured on this pod
The pod is missing the OAuth API key. Check the pod’s environment and the secret source used to provision it. If this affects only some pods, perform the fingerprint comparison above to identify the affected pods.
oauth is skipped
CREWAI_OAUTH_API_BASE_URL is not set. Expected for Factory deployments that do not use third-party OAuth connectors (Slack, HubSpot, Gmail, etc.).
llm_connections is degraded with missing required credentials
One or more LLM connections exist but are missing the environment variables required for their provider (for example, an OpenAI connection without OPENAI_API_KEY). The failing connection IDs and missing keys appear under details.missing_credentials. Fix by setting the env vars on the connection through the LLM Connections UI, or on the organization’s environment-variable scope.
llm_connections is degraded with provider(s) unreachable
A provider’s public API cannot be reached from this pod. Check outbound NetworkPolicy and DNS. A 401/403 from the provider counts as reachable — the provider answered. Only timeouts and connection failures count as unreachable.
llm_connections is skipped
No LLM connections exist yet. Expected on a fresh install or before any organization has configured an LLM.
platform_pods is degraded with N deployment(s) not Ready
One or more platform Deployments has fewer Ready replicas than desired. The offending names appear in the error message; the full per-deployment table is under details.deployments. Common causes:
- Image pull failures —
kubectl describe deploy <name> -n <namespace> - Insufficient resources on the node
- A recently failed rollout — see
details.deployments[*].condition
desired_replicas: 0 row also fails this check (somebody scaled it down by hand).
platform_pods fails with kubernetes api error: HTTP 403
The platform service account cannot list Deployments in its own namespace. This typically means the chart’s RBAC was disabled (rbac.create=false) or the platform Role was modified. Re-enable the bundled RBAC or grant list, get on apps/v1 Deployments in the install namespace.
platform_pods is skipped
The probe is not running inside a Kubernetes pod. Expected in dev or test environments — and on those topologies the entire /health/debug endpoint is also disabled (see Availability).
Operational Notes
- All checks run in parallel with a 10-second overall budget and a 3-second per-check cap.
- No check issues writes; the probe is safe to run at any time.
- The endpoint is registered only when
CREWAI_FACTORY=true. SaaS deployments return404. - For continuous liveness probes (e.g. Kubernetes readiness checks), use
GET /healthinstead.
