Factory Health & Debug

Overview

GET /health/debug is a comprehensive diagnostic endpoint exposed by every Factory deployment. Run it once after install and after any configuration change to confirm that the underlying components — database, authentication, Studio, tools, traces, OAuth, LLM connections, and platform pods — are all reachable and responsive. It is not designed for continuous polling. Use the lightweight GET /health endpoint (a cheap database ping) for liveness probes.

Available from Helm chart v0.6.11 onwards. Earlier installs do not expose /health/debug or the admin tab; upgrade the chart first.

An equivalent admin UI is available at /admin/factory_health for users with the factory-admin role. It renders the same data with status pills, latency badges, and per-deployment tables.

Availability

The /health/debug endpoint and matching admin page are only enabled on Kubernetes-based Factory deployments. Several of the underlying probes (in-cluster Service URLs, Deployment readiness checks, pod-level fingerprint comparison) only make sense on Kubernetes and would emit confidently wrong results on other provisioners.

Provider	Endpoint enabled
Kubernetes (KOTS, plain Helm)	Yes
EKS (Helm-managed)	Yes
AWS ECS / Fargate	No (returns `404`)
Local development	No

On unsupported providers the endpoint returns 404 Not Found and the admin tab redirects to the dashboard with an explanatory message. This is intentional — silent disablement is preferable to a green status pill that does not reflect the deployment topology.

What It Checks

Component	What it verifies
`database`	Primary PostgreSQL connection is reachable and responsive.
`studio`	The Studio subsystem is healthy end-to-end across four layers: (1) core tables (`Studio::Project`, `Studio::ProjectVersion`, `Studio::Execution`) are queryable; (2) the Studio v2 HTTP surface (`/studio/v2`) is mounted in the router; (3) the internal “CrewAI” organization exists with both Assistant and Runner deployment records, all reporting `Crew is Online`; (4) the per-crew Kubernetes Deployments backing those records exist with at least one Ready pod each. Namespace resolution mirrors what the provisioner would do today: `K8S_NAMESPACE` when the `k8s_namespace_isolation` feature flag is off, `<K8S_NAMESPACE>-org-<id>` when it is on. See the legacy-install caveat in the `studio crew(s) without Ready pods` troubleshooting section below.
`auth`	The configured identity provider (Okta, Auth0, Entra ID, Keycloak, WorkOS, or local) is reachable via its OIDC discovery endpoint. No real login is attempted.
`tools`	The internal tool repository is provisioned and the PyPI-compatible tools index route is mounted (used by deployed crews to resolve tool packages).
`wharf`	The Wharf OTEL/traces backend is reachable from this pod. Skipped when Wharf is disabled (`wharf.enabled: false`).
`oauth`	The CrewAI OAuth integration service is reachable, and `CREWAI_OAUTH_API_KEY` is present on this pod. Also emits a non-reversible API-key fingerprint so operators can confirm the key is consistent across all pods. Skipped when CrewAI OAuth is not configured.
`llm_connections`	Each configured LLM connection has its required credentials (e.g. `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`) set, and each unique provider’s public API is reachable. Providers with tenant-specific base URLs (Azure, Bedrock, Snowflake Cortex, Vertex, custom OpenAI-compatible) are validated for credentials only. Returns `skipped` on a fresh install with no connections yet.
`platform_pods`	Lists every long-lived Deployment shipped by the platform Helm chart (`web`, `worker`, `oauth`, `buildkit`, `wharf`, `internal-registry`) in the install namespace and verifies that `readyReplicas` equals `spec.replicas` for each. Catches partial-outage cases where the Service routes to one healthy pod while sibling replicas are `CrashLoopBackOff`. Requires the chart’s default RBAC; returns `skipped` outside Kubernetes.

Calling the Endpoint

/health/debug requires authorization on every request. Unauthorized callers receive 404 Not Found — the endpoint’s existence is deliberately not advertised. This prevents the probe from being used as an unauthenticated amplifier for LLM-provider calls, Kubernetes API calls, and in-cluster service probes.For lightweight anonymous health checks (load-balancer probes, Kubernetes readiness probes, external uptime monitors), use GET /health instead — it performs only a cheap database ping.

Two caller identities are accepted:

Programmatic access — present the FACTORY_DEBUG_TOKEN via the X-Factory-Debug-Token header. The token is auto-generated and stored in the platform Secret; see The Debug Token below for how to retrieve it.
Interactive access — sign in with the factory-admin role and navigate to /admin/factory_health, which calls the same probe.

TOKEN=$(kubectl get secret <release-name>-secrets -n <namespace> \
  -o jsonpath='{.data.FACTORY_DEBUG_TOKEN}' | base64 -d)

curl -H "X-Factory-Debug-Token: $TOKEN" \
  https://<your-factory-host>/health/debug

{
  "status": "degraded",
  "duration_ms": 482,
  "environment": {
    "factory": true,
    "auth_provider": "okta",
    "version": "2026.04.22-abc123"
  },
  "components": {
    "database": { "status": "ok", "latency_ms": 12, "checks": ["select_1"] },
    "studio": {
      "status": "ok",
      "latency_ms": 78,
      "checks": [
        "project_table_readable",
        "studio_v2_route_mounted",
        "agent_install_complete",
        "runner_install_complete",
        "crew_pods_ready"
      ],
      "details": {
        "internal_organization": "CrewAI",
        "internal_deployments": [
          { "id": 100, "slug": "studio-v2-assistant", "status": "Crew is Online", "ready": true },
          { "id": 101, "slug": "studio-v2-runner",    "status": "Crew is Online", "ready": true }
        ],
        "crew_pods_status": "ok",
        "crew_pods": {
          "namespaces": ["crewai-crews"],
          "deployment_count": 2,
          "ready_count": 2,
          "deployments": [
            { "slug": "studio-v2-assistant", "app_id": "org-42-crew-100-studio-v2-assistant", "namespace": "crewai-crews", "healthy": true },
            { "slug": "studio-v2-runner",    "app_id": "org-42-crew-101-studio-v2-runner",    "namespace": "crewai-crews", "healthy": true }
          ]
        }
      }
    },
    "auth":  { "status": "ok",   "latency_ms": 210, "provider": "okta", "http_status": 200 },
    "tools": { "status": "fail", "latency_ms": 3,   "error": "public tool repository not provisioned" },
    "oauth": {
      "status": "ok",
      "latency_ms": 95,
      "details": {
        "host": "oauth.example.com",
        "api_key_configured": true,
        "api_key_fingerprint": "3c1a9e0f4b22",
        "pod": "crewai-web-5d6c7f-abcd1"
      }
    }
  }
}

Interpreting the Verdict

Top-level `status`	Meaning	HTTP code
`ok`	All components healthy. The deployment is fully operational.	`200`
`degraded`	A non-critical component (Studio, Tools, Wharf, OAuth, LLM connections, platform pods) has failed. The deployment is up but some functionality may be impaired.	`200`
`fail`	A critical component (database or authentication) has failed. The deployment cannot serve users.	`503`

A component with status: skipped is not a failure — it means the component is not configured for this deployment (for example, Wharf traces are off, or CrewAI OAuth is intentionally disabled). Skipped components do not affect the top-level verdict.

The Debug Token

FACTORY_DEBUG_TOKEN authorizes the X-Factory-Debug-Token header path on /health/debug. It is managed by the chart the same way Rails encryption keys are:

Auto-generated on first install using randAlphaNum 64 and stored in the platform Secret.
Persisted across helm upgrade via the chart’s lookup function, so every pod across every rollout sees the same value.
Overridable by setting secrets.FACTORY_DEBUG_TOKEN in your values file (recommended for ArgoCD users — see ArgoCD Deployment).

See secrets.FACTORY_DEBUG_TOKEN for full details.

Retrieving the Current Token

kubectl get secret <release-name>-secrets -n <namespace> \
  -o jsonpath='{.data.FACTORY_DEBUG_TOKEN}' | base64 -d

Use the returned value as the X-Factory-Debug-Token header:

TOKEN=$(kubectl get secret <release-name>-secrets -n <namespace> \
  -o jsonpath='{.data.FACTORY_DEBUG_TOKEN}' | base64 -d)

curl -H "X-Factory-Debug-Token: $TOKEN" \
  https://<your-factory-host>/health/debug

Rotating the Token

Set a new value in secrets.FACTORY_DEBUG_TOKEN and run helm upgrade. The chart’s secret-change detection triggers a rolling restart of the platform pods, and the new token is picked up automatically with no downtime. To force regeneration of a random value on the next install only, unset the override and delete the key from the existing Secret before upgrading — otherwise lookup will preserve the current value across upgrades.

Verifying OAuth API Keys Are Consistent Across Pods

The oauth component emits two fields specifically for multi-pod operators:

api_key_fingerprint — the first 12 hex chars of SHA256(CREWAI_OAUTH_API_KEY). Non-reversible. Identical keys produce identical fingerprints; different keys produce different fingerprints.
pod — the hostname of the pod that served the request.

To confirm every web pod is running with the same API key, call the endpoint against each pod directly (bypassing the load balancer) and compare fingerprints:

for pod in $(kubectl get pods -l app.kubernetes.io/component=web -o name); do
  kubectl exec -n <namespace> $pod -- \
    curl -s -H "X-Factory-Debug-Token: $FACTORY_DEBUG_TOKEN" \
    http://localhost:3000/health/debug \
    | jq '.components.oauth.details | {pod, api_key_fingerprint}'
done

A mismatch indicates that one or more pods were started with a different CREWAI_OAUTH_API_KEY value — usually because a rolling deploy picked up a stale secret, or because of a mid-rollout environment override. Restart the affected pods after correcting the secret source.

Troubleshooting

`database` fails

PostgreSQL is unreachable or credentials are wrong. Check the pod’s DATABASE_URL, NetworkPolicy, and the database’s own readiness.

`auth` fails with `ConnectionFailed` or `TimeoutError`

The configured identity provider cannot be reached from inside the cluster. Check egress NetworkPolicy and the provider configuration (OKTA_SITE, AUTH0_DOMAIN, KEYCLOAK_SITE, etc.).

`auth` fails with `HTTP 4xx/5xx`

The IdP is reachable but its OIDC discovery endpoint is not responding correctly. Verify the authorization server URL and tenant configuration.

`studio` fails with `internal organization not found`

The studio:install_internal_organization rake task was never run, or the internal “CrewAI” organization was deleted. Without it the Assistant and Runner crews cannot be installed. See Post-Installation for the install command.

`studio` fails with `agent install incomplete` or `runner install incomplete`

The corresponding studio:agent:install or studio:runner:install task either was never run or has not completed successfully. The details.agent / details.runner payload shows which slugs are expected, which exist, and which are not yet Crew is Online. Re-run the matching install task.

`studio` fails with `studio crew(s) without Ready pods`

The Deployment record reports Crew is Online but the corresponding Kubernetes Deployment either does not exist or has zero Ready pods. This is the classic “DB record went stale after the pods were deleted by hand” failure mode. The error message names the namespace each unhealthy crew was expected in — check details.crew_pods.namespaces to see every namespace the probe consulted, and the per-row namespace field in details.crew_pods.deployments to pinpoint a specific crew.

Legacy installs that turned k8s_namespace_isolation on after the fact will see false failures here. The probe reads the feature flag at probe time and assumes every crew lives in the namespace the current flag setting would route it to — it does not inspect each crew’s deployment history. So an install that was provisioned with the flag off (crews in K8S_NAMESPACE) and later flipped the flag on will have its Studio crews reported as missing in <K8S_NAMESPACE>-org-<id>. The k8s_namespace_isolation feature is fresh-install-only by design — see Multi-org Namespace Isolation for the supported rollout path. If you are on a legacy install in this state, the only correct fix is to re-provision the affected crews into the per-org namespaces; turning the flag back off will not restore them either.

To recover from a normal “stale DB record” failure:

kubectl -n <crew-namespace> get deploy -l app-id=<expected-app-id>

If the result is empty, re-run the matching studio:*:install task to re-provision.

`studio`’s `crew_pods_status` is `skipped`

This pod is not running inside Kubernetes, or the crew namespace environment variable is not set. The DB, route, and install verification still ran — only the pod-level cross-check was skipped. Set the crew namespace on the platform pod (typically crewai-crews) and the probe will start querying it.

`tools` fails with `public tool repository not provisioned`

The seed migration that creates the default public tool repository did not run. Run the database seed task as part of the install.

`wharf` fails or is skipped

Skipped is expected when Wharf is disabled (wharf.enabled: false). A ConnectionFailed or TimeoutError means WHARF_URL is set but the service cannot be reached — check egress NetworkPolicy, DNS, and that the Wharf service itself is running.

`oauth` fails with `CREWAI_OAUTH_API_KEY is not configured on this pod`

The pod is missing the OAuth API key. Check the pod’s environment and the secret source used to provision it. If this affects only some pods, perform the fingerprint comparison above to identify the affected pods.

`oauth` is skipped

CREWAI_OAUTH_API_BASE_URL is not set. Expected for Factory deployments that do not use third-party OAuth connectors (Slack, HubSpot, Gmail, etc.).

`llm_connections` is degraded with `missing required credentials`

One or more LLM connections exist but are missing the environment variables required for their provider (for example, an OpenAI connection without OPENAI_API_KEY). The failing connection IDs and missing keys appear under details.missing_credentials. Fix by setting the env vars on the connection through the LLM Connections UI, or on the organization’s environment-variable scope.

`llm_connections` is degraded with `provider(s) unreachable`

A provider’s public API cannot be reached from this pod. Check outbound NetworkPolicy and DNS. A 401/403 from the provider counts as reachable — the provider answered. Only timeouts and connection failures count as unreachable.

`llm_connections` is skipped

No LLM connections exist yet. Expected on a fresh install or before any organization has configured an LLM.

`platform_pods` is degraded with `N deployment(s) not Ready`

One or more platform Deployments has fewer Ready replicas than desired. The offending names appear in the error message; the full per-deployment table is under details.deployments. Common causes:

Image pull failures — kubectl describe deploy <name> -n <namespace>
Insufficient resources on the node
A recently failed rollout — see details.deployments[*].condition

The chart never ships components at zero replicas, so a desired_replicas: 0 row also fails this check (somebody scaled it down by hand).

`platform_pods` fails with `kubernetes api error: HTTP 403`

The platform service account cannot list Deployments in its own namespace. This typically means the chart’s RBAC was disabled (rbac.create=false) or the platform Role was modified. Re-enable the bundled RBAC or grant list, get on apps/v1 Deployments in the install namespace.

`platform_pods` is skipped

The probe is not running inside a Kubernetes pod. Expected in dev or test environments — and on those topologies the entire /health/debug endpoint is also disabled (see Availability).

Operational Notes

All checks run in parallel with a 10-second overall budget and a 3-second per-check cap.
No check issues writes; the probe is safe to run at any time.
The endpoint is registered only when CREWAI_FACTORY=true. SaaS deployments return 404.
For continuous liveness probes (e.g. Kubernetes readiness checks), use GET /health instead.

Installation

Configuration

Cloud Providers

Operations

Feature Guides

Reference

Overview

Availability

What It Checks

Calling the Endpoint

Interpreting the Verdict

The Debug Token

Retrieving the Current Token

Rotating the Token

Verifying OAuth API Keys Are Consistent Across Pods

Troubleshooting

`database` fails

`auth` fails with `ConnectionFailed` or `TimeoutError`

`auth` fails with `HTTP 4xx/5xx`

`studio` fails with `internal organization not found`

`studio` fails with `agent install incomplete` or `runner install incomplete`

`studio` fails with `studio crew(s) without Ready pods`

`studio`’s `crew_pods_status` is `skipped`

`tools` fails with `public tool repository not provisioned`

`wharf` fails or is skipped

`oauth` fails with `CREWAI_OAUTH_API_KEY is not configured on this pod`

`oauth` is skipped

`llm_connections` is degraded with `missing required credentials`

`llm_connections` is degraded with `provider(s) unreachable`

`llm_connections` is skipped

`platform_pods` is degraded with `N deployment(s) not Ready`

`platform_pods` fails with `kubernetes api error: HTTP 403`

`platform_pods` is skipped

Operational Notes

Installation

Configuration

Cloud Providers

Operations

Feature Guides

Reference

Documentation Index

​Overview

​Availability

​What It Checks

​Calling the Endpoint

​Interpreting the Verdict

​The Debug Token

​Retrieving the Current Token

​Rotating the Token

​Verifying OAuth API Keys Are Consistent Across Pods

​Troubleshooting

​database fails

​auth fails with ConnectionFailed or TimeoutError

​auth fails with HTTP 4xx/5xx

​studio fails with internal organization not found

​studio fails with agent install incomplete or runner install incomplete

​studio fails with studio crew(s) without Ready pods

​studio’s crew_pods_status is skipped

​tools fails with public tool repository not provisioned

​wharf fails or is skipped

​oauth fails with CREWAI_OAUTH_API_KEY is not configured on this pod

​oauth is skipped

​llm_connections is degraded with missing required credentials

​llm_connections is degraded with provider(s) unreachable

​llm_connections is skipped

​platform_pods is degraded with N deployment(s) not Ready

​platform_pods fails with kubernetes api error: HTTP 403

​platform_pods is skipped

​Operational Notes

Overview

Availability

What It Checks

Calling the Endpoint

Interpreting the Verdict

The Debug Token

Retrieving the Current Token

Rotating the Token

Verifying OAuth API Keys Are Consistent Across Pods

Troubleshooting

`database` fails

`auth` fails with `ConnectionFailed` or `TimeoutError`

`auth` fails with `HTTP 4xx/5xx`

`studio` fails with `internal organization not found`

`studio` fails with `agent install incomplete` or `runner install incomplete`

`studio` fails with `studio crew(s) without Ready pods`

`studio`’s `crew_pods_status` is `skipped`

`tools` fails with `public tool repository not provisioned`

`wharf` fails or is skipped

`oauth` fails with `CREWAI_OAUTH_API_KEY is not configured on this pod`

`oauth` is skipped

`llm_connections` is degraded with `missing required credentials`

`llm_connections` is degraded with `provider(s) unreachable`

`llm_connections` is skipped

`platform_pods` is degraded with `N deployment(s) not Ready`

`platform_pods` fails with `kubernetes api error: HTTP 403`

`platform_pods` is skipped

Operational Notes