Skip to content

feat(server): serve health endpoints on separate unauthenticated port#903

Merged
TaylorMutch merged 1 commit intoNVIDIA:mainfrom
sjenning:feat/separate-health-endpoint
Apr 21, 2026
Merged

feat(server): serve health endpoints on separate unauthenticated port#903
TaylorMutch merged 1 commit intoNVIDIA:mainfrom
sjenning:feat/separate-health-endpoint

Conversation

@sjenning
Copy link
Copy Markdown
Contributor

Summary

Move /health, /healthz, and /readyz to a dedicated plaintext HTTP port (default 8081) so Kubernetes probes work without mTLS client certificates. Also fixes the Docker cluster healthcheck to avoid sending plaintext bytes into the TLS listener.

Related Issue

Fixes #897

Changes

  • Add health_bind_address to Config with --health-port CLI arg (default 8081, env OPENSHELL_HEALTH_PORT)
  • Spawn standalone axum::serve for health_router on the health port (plain HTTP, no TLS)
  • Remove health routes from the main multiplexed HTTP router
  • Add port-collision validation (--port vs --health-port)
  • Update Helm statefulset: add health container port, switch all probes from tcpSocket to httpGet on health port
  • Fix cluster-healthcheck.sh to open/close TCP without sending data, avoiding InvalidContentType TLS errors

Testing

  • cargo check -p openshell-core -p openshell-server passes
  • cargo test -p openshell-core -p openshell-server --lib passes (223 tests)
  • Verified health server listens on 8081 in running cluster
  • Verified Kubernetes probes correctly target health port
  • Identified and fixed cluster-healthcheck.sh as source of remaining TLS errors
  • E2E tests (mise run e2e)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

Move /health, /healthz, and /readyz to a dedicated plaintext HTTP port
(default 8081) so Kubernetes probes work without mTLS client certificates.

- Add health_bind_address to Config with --health-port CLI arg
- Spawn standalone axum::serve for health_router on the health port
- Remove health routes from the main multiplexed HTTP router
- Update Helm statefulset probes from tcpSocket to httpGet on health port
- Fix cluster-healthcheck.sh to open/close TCP without sending data,
  avoiding InvalidContentType TLS errors in the gateway log
@sjenning sjenning requested a review from a team as a code owner April 21, 2026 14:53
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@TaylorMutch TaylorMutch self-assigned this Apr 21, 2026
Copy link
Copy Markdown
Collaborator

@TaylorMutch TaylorMutch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of optional nits

Comment thread crates/openshell-core/src/config.rs
Comment thread crates/openshell-server/src/cli.rs
@TaylorMutch TaylorMutch added the test:e2e Requires end-to-end coverage label Apr 21, 2026
@TaylorMutch
Copy link
Copy Markdown
Collaborator

recheck

@TaylorMutch
Copy link
Copy Markdown
Collaborator

Tested locally and confirmed this all works - Thanks!

@TaylorMutch TaylorMutch merged commit bd11395 into NVIDIA:main Apr 21, 2026
11 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: gateway logs InvalidContentType TLS errors every 5s from kubelet probes

3 participants