Skip to content

fix(k8s-driver): use dedicated kube client without read_timeout for watches#907

Merged
TaylorMutch merged 1 commit intoNVIDIA:mainfrom
sjenning:fix/k8s-driver-watch-noise
Apr 21, 2026
Merged

fix(k8s-driver): use dedicated kube client without read_timeout for watches#907
TaylorMutch merged 1 commit intoNVIDIA:mainfrom
sjenning:fix/k8s-driver-watch-noise

Conversation

@sjenning
Copy link
Copy Markdown
Contributor

@sjenning sjenning commented Apr 21, 2026

Summary

The shared Kubernetes client's 30-second read_timeout was terminating long-lived watch streams during idle periods, causing a reconnect cycle every 30 seconds. This creates a dedicated watch_client with read_timeout: None for watch operations while preserving the timeout-protected client for CRUD operations.

Avoids these warnings in the openshell-server log

2026-04-21T14:54:28.803303Z  WARN openshell_server::compute: Compute driver watch stream errored error=status: Internal, message: "watch stream failed: Error reading events stream: ServiceError: error reading a body from connection", details: [], metadata: MetadataMap { headers: {} }

Changes

  • Add a watch_client field to KubernetesComputeDriver with read_timeout: None
  • Add watch_api() helper that uses the watch client
  • Update watch_sandboxes to use the watch client for both sandbox and event watchers
  • CRUD operations (create, get, list, delete) continue using the original timeout-protected client

Testing

  • mise run pre-commit passes
  • cargo check -p openshell-driver-kubernetes compiles clean
  • E2E tests added/updated (if applicable)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

@sjenning sjenning requested a review from a team as a code owner April 21, 2026 18:20
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…atches

The 30s read_timeout on the shared kube client was killing the
long-lived watch streams during idle periods, causing a reconnect
cycle every 30 seconds. Use a separate client with no read_timeout
for watch_sandboxes so the streams stay open indefinitely.
@sjenning sjenning force-pushed the fix/k8s-driver-watch-noise branch from 9cf141e to c819cf9 Compare April 21, 2026 18:22
@TaylorMutch TaylorMutch self-assigned this Apr 21, 2026
@TaylorMutch TaylorMutch added the test:e2e Requires end-to-end coverage label Apr 21, 2026
@TaylorMutch TaylorMutch merged commit 42c3cf6 into NVIDIA:main Apr 21, 2026
10 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants