feat(server,driver-vm,e2e): gateway-owned readiness + VM compute driver e2e#901
Open
feat(server,driver-vm,e2e): gateway-owned readiness + VM compute driver e2e#901
Conversation
…on relay
Introduce a persistent supervisor-to-gateway session (ConnectSupervisor
bidirectional gRPC RPC) and migrate /connect/ssh and ExecSandbox onto
relay channels coordinated through it.
Architecture:
- gRPC control plane: carries session lifecycle (hello, heartbeat) and
relay lifecycle (RelayOpen, RelayOpenResult, RelayClose)
- HTTP data plane: for each relay, the supervisor opens a reverse HTTP
CONNECT to /relay/{channel_id} on the gateway; the gateway bridges
the client stream with the supervisor stream
- The supervisor is a dumb byte bridge with no SSH/NSSH1 awareness;
the gateway sends the NSSH1 preface through the relay
Key changes:
- Add ConnectSupervisor RPC and session/relay proto messages
- Add gateway session registry (SupervisorSessionRegistry) with
pending-relay map for channel correlation
- Add /relay/{channel_id} HTTP CONNECT endpoint
- Rewire /connect/ssh: session lookup + RelayOpen instead of direct
TCP dial to sandbox:2222
- Rewire ExecSandbox: relay-based proxy instead of direct sandbox dial
- Add supervisor session client with reconnect and relay bridge
- Remove ResolveSandboxEndpoint from proto, gateway, and K8s driver
Closes OS-86
When a sandbox first reports Ready, the supervisor session may not have completed its gRPC handshake yet. Instead of failing immediately with 502 / "supervisor session not connected", the relay open now retries with exponential backoff (100ms → 2s) for up to 15 seconds. This fixes the race between K8s marking the pod Ready and the supervisor establishing its ConnectSupervisor session.
Three related changes:
1. Fold the session-wait into `open_relay` itself via a new `wait_for_session`
helper with exponential backoff (100ms → 2s). Callers pass an explicit
`session_wait_timeout`:
- SSH connect uses 30s — it typically runs right after `sandbox create`,
so the timeout has to cover a cold supervisor's TLS + gRPC handshake.
- ExecSandbox uses 15s — during normal operation it only needs to cover
a transient supervisor reconnect window.
This covers both the startup race (pod Ready before the supervisor's
ConnectSupervisor stream is up) and mid-lifetime reconnects after a
network blip or gateway/supervisor restart — both look identical to the
caller.
2. Fix a supersede cleanup race. `LiveSession` now tracks a `session_id`,
and `remove_if_current(sandbox_id, session_id)` only evicts when the
registered entry still matches. Previously an old session's cleanup
could run after a reconnect had already registered the new session,
unconditionally removing the live registration.
3. Wire up `spawn_relay_reaper` alongside the existing SSH session reaper
so expired pending relay entries (supervisor acknowledged RelayOpen but
never opened the reverse CONNECT) are swept every 30s instead of
leaking until someone tries to claim them.
Adds 12 unit tests covering: open_relay happy path, timeout, mid-wait
session appearance, closed-receiver failure, supersede routing; claim_relay
unknown/expired/receiver-dropped/round-trip; and the remove_if_current
cleanup-race regression.
Replace the supervisor's reverse HTTP CONNECT data plane with a new
`RelayStream` gRPC RPC. Each relay now rides the supervisor's existing
`ConnectSupervisor` TCP+TLS+HTTP/2 connection as a new HTTP/2 stream,
multiplexed natively. Removes one TLS handshake per SSH/exec session.
- proto: add `RelayStream(stream RelayChunk) returns (stream RelayChunk)`;
the first chunk from the supervisor carries `channel_id` and no data,
matching the existing RelayOpen channel_id. Subsequent chunks are
bytes-only — leaving channel_id off data frames avoids a ~36 B
per-frame tax that would hurt interactive SSH.
- server: add `handle_relay_stream` alongside `handle_connect_supervisor`.
It reads the first RelayChunk for channel_id, claims the pending relay
(same `SupervisorSessionRegistry::claim_relay` path as before, returning
a `DuplexStream` half), then bridges that half ↔ the gRPC stream via
two tasks (16 KiB chunks). Delete `relay.rs` and its `/relay/{channel_id}`
HTTP endpoint.
- sandbox: on `RelayOpen`, open a `RelayStream` RPC on the existing
`Channel`, send `RelayChunk { channel_id, data: [] }` as the first frame,
then bridge the local SSH socket. Drop `open_reverse_connect`,
`send_connect_request`, `connect_tls`, and the `hyper`, `hyper-util`,
`http`, `http-body-util` deps that existed solely for the reverse CONNECT.
- tests: add `RelayStreamStream` type alias and `relay_stream` stub to the
seven `OpenShell` mock impls in server + CLI integration tests.
The registry shape (pending_relays, claim_relay, RelayOpen control message,
DuplexStream bridging) is unchanged, so the existing session-wait / supersede
/ reaper hardening on feat/supervisor-session-relay carries over intact.
… plane Default h2 initial windows are 64 KiB per stream and 64 KiB per connection. That throttles a single RelayStream SSH tunnel to ~500 Mbps on LAN, roughly 35% below the raw HTTP CONNECT baseline measured on `nemoclaw`. Bump both server (hyper-util auto::Builder via multiplex.rs) and client (tonic Endpoint in openshell-sandbox/grpc_client.rs) windows to 2 MiB / 4 MiB. This is the window size at which bulk throughput on a 50 MiB transfer matches the reverse HTTP CONNECT path. The numbers apply only to the RelayStream data plane in this branch; ConnectSupervisor and all other RPCs benefit too but are low-rate.
…, drop NSSH1
The embedded SSH daemon in openshell-sandbox no longer listens on a TCP
port. Instead it binds a root-owned Unix socket (default
/run/openshell/ssh.sock, 0700 parent dir, 0600 socket). The supervisor's
relay bridge connects to that socket instead of 127.0.0.1:2222.
With the socket gated by filesystem permissions, the NSSH1 HMAC preface
is redundant and has been removed:
- openshell-sandbox: drop `verify_preface`, `hmac_sha256`, the nonce
cache and reaper, and the preface read/write on every SSH accept.
`run_ssh_server` takes a `PathBuf` and uses `UnixListener`.
- openshell-server/ssh_tunnel: remove the NSSH1 write + response read
before bridging the client's upgraded CONNECT stream; the relay is
now bridged immediately.
- openshell-server/grpc/sandbox: same cleanup in the exec-path relay
proxy. `stream_exec_over_relay` and `start_single_use_ssh_proxy_over_relay`
stop taking a `handshake_secret`.
- openshell-server lib: the K8s driver is now configured with the
socket path ("/run/openshell/ssh.sock") instead of "0.0.0.0:2222".
- Parent directory of the socket is created with 0700 root:root by the
supervisor at startup to keep the sandbox entrypoint user out.
`ssh_handshake_secret` is still accepted on the CLI / env for backwards
compatibility but is no longer used for SSH.
Adds `sandbox_ssh_socket_path` to `Config` (default `/run/openshell/ssh.sock`). The K8s driver is now wired with the configured value instead of a hard-coded path. K8s and VM drivers already isolate the socket via per-pod / per-VM filesystems, so the default is safe there. This makes it easy to override in local dev when multiple supervisors share a filesystem, matching the prior `OPENSHELL_SSH_LISTEN_ADDR` knob on the supervisor side.
Adds tests/supervisor_relay_integration.rs covering the RelayStream wire contract, handshake frame, bridging, and claim timing. Five cases: happy-path echo, gateway drop, supervisor drop, no-session timeout, and concurrent multiplexed relays on one session. Narrows handle_relay_stream to take &SupervisorSessionRegistry directly so the test can exercise the real handler without standing up a full ServerState. Adds register_for_test for the same reason.
…ents Emits NetworkActivity events for session open/close/fail and relay open/close/fail from the sandbox side. Keeps plain tracing for internal plumbing (SSH socket connect, gateway stream close observation). Event shapes are extracted into pure builder fns so unit tests can assert activity/severity/status without wiring up a tracing subscriber. Gateway endpoint is parsed into host + port for dst_endpoint.
Adds ServerAliveInterval=15 and ServerAliveCountMax=3 to both the rendered ssh-config block and the direct ssh invocation used by `openshell sandbox connect`. Without this, a client-side SSH session hangs indefinitely when the gateway or supervisor dies mid-session: the relay transport's TCP connection can't signal EOF to the client because the peer process is gone, not cleanly closing. Detection now takes ~45s instead of the TCP keepalive default of 2 hours. Verified on a live cluster by deleting the gateway pod and the sandbox pod mid-session — SSH exits with "Broken pipe" after one missed ServerAlive reply.
The RPC was used by the direct gateway→sandbox SSH/exec path, which is
gone — connect/ssh and ExecSandbox both ride the supervisor session
relay now. Removes the RPC, SandboxEndpoint/ResolveSandboxEndpoint*
messages, and the now-dead ssh_port / sandbox_ssh_port config fields
across openshell-core, openshell-server, openshell-driver-kubernetes,
and openshell-driver-vm.
The k8s driver's standalone binary also stops synthesizing a TCP
listen address ("0.0.0.0:<port>") and reads the Unix socket path
directly from OPENSHELL_SANDBOX_SSH_SOCKET_PATH.
…rename ssh-listen-addr → ssh-socket-path Renames the sandbox binary's `--ssh-listen-addr` / `OPENSHELL_SSH_LISTEN_ADDR` / `ssh_listen_addr` to `--ssh-socket-path` / `OPENSHELL_SSH_SOCKET_PATH` / `ssh_socket_path` so the flag name matches its sole accepted form (a Unix socket filesystem path) after the supervisor-initiated relay migration. Migrates the VM compute driver to the same supervisor-initiated model used by the K8s driver: the in-guest sandbox now binds `/run/openshell/ssh.sock` and opens its own outbound `ConnectSupervisor` session to the gateway, so the host→guest SSH port-forward is no longer needed. Drops `--vm-port` plumbing, the `ssh_port` allocation path, the `port_is_ready` TCP probe, and the now- unused `GUEST_SSH_PORT` import from `driver.rs`. Readiness falls back to the existing console-log marker from `guest_ssh_ready`. Remaining `ssh_port` / `GUEST_SSH_PORT` residue in `openshell-driver-vm/src/runtime.rs` (gvproxy port-mapping plan) is dead but left for OS-102, which already covers NSSH1/handshake plumbing removal across crates.
…p historical prose Updates `sandbox-connect.md`, `gateway.md`, `sandbox.md`, `gateway-security.md`, and `system-architecture.md` to describe the current supervisor-initiated model forward-facing: two-plane `ConnectSupervisor` + `RelayStream` design, the registry's `open_relay` / `claim_relay` / reaper behaviour, Unix-socket sshd access control, and the sandbox-side OCSF event surface. Strips historical framing that describes what was removed — the "Earlier designs..." paragraph, the "Historical: NSSH1 Handshake (removed)" subsection, retained-for-compat config/env table rows, and scattered "no longer X" prose — in favour of clean current-state descriptions. Syncs env- var and flag names to the renamed `--ssh-socket-path` / `OPENSHELL_SSH_SOCKET_PATH`.
Updates user-facing docs to match the connect/exec transport change: - `docs/security/best-practices.mdx` — SSH tunnel section now describes traffic riding the sandbox's mTLS session (transport auth) plus a short-lived session token scoped to the sandbox (authorization), with the sandbox's sshd bound to a local Unix socket rather than a TCP port. Removes the stale mention of the NSSH1 HMAC handshake. - `docs/observability/logging.mdx` — example OCSF shorthand lines for SSH:LISTEN / SSH:OPEN updated to reflect the current emit shape (no peer endpoint on the Unix-socket listener, no NSSH1 auth tag).
Adds two `ResourceExhausted`-returning guards on `open_relay` to bound the `pending_relays` map against runaway or abusive callers: - `MAX_PENDING_RELAYS = 256` — upper bound across all sandboxes. Caps the memory a caller can pin by calling `open_relay` in a loop while no supervisor ever claims (or the supervisor is hung). - `MAX_PENDING_RELAYS_PER_SANDBOX = 32` — per-sandbox ceiling so one noisy tenant can't consume the entire global budget. Sits above the existing SSH-tunnel per-sandbox cap (20) so tunnel-specific limits still fire first for that caller. Both checks and the `pending_relays` insert happen under a single lock hold so concurrent callers can't each observe "under the cap" and both insert past it. Adds a `sandbox_id` field on `PendingRelay` so the per-sandbox count is a single filter over the map without extra indexes. Tests: - Two unit tests in `supervisor_session.rs` — assert the global cap and the per-sandbox cap both return `ResourceExhausted` with the right message, and a cap-hit on one sandbox doesn't leak onto others. - One integration test in `supervisor_relay_integration.rs` — bursts 64 concurrent `open_relay` calls at a single sandbox and asserts exactly 32 succeed, exactly 32 are rejected with the per-sandbox message, and a different sandbox still accepts new relays. Reaper behaviour is unchanged; the cap makes the map bounded, so the existing `HashMap::retain` pass stays cheap under any load.
…on-grpc-data # Conflicts: # architecture/gateway.md # crates/openshell-sandbox/src/main.rs
…on-grpc-data # Conflicts: # crates/openshell-cli/tests/sandbox_name_fallback_integration.rs
The gateway now owns the Ready transition for sandboxes. When a ConnectSupervisor RPC registers a session in SupervisorSessionRegistry, ComputeRuntime promotes the sandbox record from Provisioning to Ready with reason=SupervisorConnected. When the session ends, the sandbox is demoted back to Provisioning with reason=SupervisorDisconnected. This replaces compute-driver-specific liveness probes (log grep, TCP poll) with the authoritative signal that the relay plane for SSH and exec is live end-to-end. Drivers that still want to report Error conditions (e.g. the VM driver on a dead launcher process) continue to do so; only the Ready transition moves. Wiring is observer-based: SupervisorSessionRegistry holds an optional trait object installed by ComputeRuntime::install_supervisor_observer during server startup. Session register/remove_if_current invoke on_session_connected/on_session_disconnected off the internal lock; the observer spawns async tasks to update the persisted sandbox record and notify the watch bus. Backfill path in apply_sandbox_update_locked handles the register-before-store race: when a driver snapshot arrives and the registry already has a live session for that sandbox, the phase is promoted on the spot. Adds 8 unit tests covering promote/demote transitions, terminal-state no-ops, idempotent re-register, and the backfill race. Adds 4 registry-level tests covering observer fire-once semantics and supersede-race guards.
The VM driver no longer owns the Ready transition — the gateway-side SupervisorSessionObserver now promotes sandboxes to Ready when their supervisor session connects. Remove guest_ssh_ready() (a brittle grep over the serial console) and the ready_condition() helper. monitor_sandbox still watches the launcher child process and emits Error conditions on ProcessExited / ProcessPollFailed. Also always start gvproxy, not just when port_map is non-empty. With the supervisor-initiated relay migration in #867, the SSH port forward was dropped; that left port_map empty in the default path, which in turn skipped gvproxy startup, which left the guest with no eth0 and no route to the host gateway. The guest supervisor's outbound ConnectSupervisor stream needs gvproxy to reach host.containers.internal (rewritten to 192.168.127.1 inside the guest), so gvproxy is structurally required for any sandbox that talks to the gateway. Inline the gvproxy setup into an unconditional block that returns (guard, api_sock, forwarded_port_map), dropping the mutable plumbing the prior conditional form needed. Remove the now-dead VmContext::set_port_map wrapper; mark its libkrun FFI binding #[allow(dead_code)] so a future reintroduction doesn't need to touch the symbol table.
Rewrite e2e/rust/e2e-vm.sh for the split-binary flow (openshell-gateway
+ openshell-driver-vm) now that the former openshell-vm K8s-in-a-VM
binary is gone. The new flow:
1. Stage the embedded VM runtime (libkrun + gvproxy + base rootfs)
via mise run vm:setup and mise run vm:rootfs -- --base, both
idempotent and run only when artifacts are missing.
2. Build openshell-gateway, openshell-driver-vm, and the openshell
CLI from the current workspace with cargo.
3. On macOS, codesign the driver with the Hypervisor.framework
entitlement so libkrun can start the microVM.
4. Start the gateway with --drivers vm --disable-tls
--disable-gateway-auth --db-url sqlite::memory:, pinning
--driver-dir target/debug so the gateway picks up the freshly
built driver rather than ~/.local/libexec/openshell from a
prior install-vm.sh run.
5. Wait for 'Server listening', run the cluster-agnostic Rust smoke
test against OPENSHELL_GATEWAY_ENDPOINT=http://127.0.0.1:<port>,
then SIGTERM the gateway.
State paths root under /tmp rather than target/ because the VM
driver's compute-driver.sock lives under --vm-driver-state-dir; with
AF_UNIX SUN_LEN = 104 bytes on macOS (108 on Linux), worktree paths
under target/ routinely blow the limit.
On failure, the trap preserves the per-run state dir plus dumps the
gateway log and every sandbox's rootfs-console.log inline so CI
artifacts capture post-mortem data.
Drop the former --vm-port / --vm-name reuse path entirely — the new
gateway is cheap to start (a few seconds, no k3s bootstrap) and that
reuse flow mapped to openshell-vm's StatefulSet rollout, which no
longer exists. Drop the build:docker:gateway and vm:build task
dependencies from tasks/test.toml's e2e:vm for the same reason.
With the SSH port forward removed in #867 and no other host→guest port mappings in play, everything that configured gvproxy's port-forwarder is dead weight. gvproxy stays because the VM still needs its virtual NIC, DHCP server, and default router for guest egress, and because the sandbox supervisor's per-sandbox netns (veth + iptables, see openshell-sandbox/src/sandbox/linux/netns.rs) needs a real kernel network stack inside the guest to branch off of — libkrun's built-in TSI socket impersonation would not satisfy those primitives. What we stop doing: * Dropping the `-listen` API socket. No one calls `/services/forwarder/expose` on it any more. * Passing `-ssh-port -1`. gvproxy's default 2222 SSH forward binds a host-side TCP listener that would race concurrent sandboxes and surface a misleading 'sshd is reachable' endpoint. `-1` is gvproxy's documented switch for 'no SSH forward'; see getForwardsMap in containers/gvisor-tap-vsock cmd/gvproxy/main.go. * Removing VmLaunchConfig::port_map and the CLI --vm-port flag. * Removing krun_set_port_map from the libkrun FFI bindings. * Removing helpers that only made sense when we had a port map to manage: plan_gvproxy_ports, parse_port_mapping, expose_port_map, gvproxy_expose, pick_gvproxy_ssh_port, kill_stale_gvproxy_by_port, kill_stale_gvproxy_by_port_map, kill_gvproxy_pid, is_process_named, and the GUEST_SSH_PORT constant. * Removing the four port-mapping unit tests. Verified: after `sandbox create -- echo hi`, `lsof` shows gvproxy opens zero TCP listeners; only its qemu/vfkit unixgram data socket remains. E2E smoke still passes in ~10s.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Makes the VM compute driver end-to-end path work on top of the supervisor-initiated relay in #867, and moves the authoritative "sandbox is Ready" transition from each compute driver onto the gateway. The smoke test against
openshell-gateway --drivers vm(mise run e2e:vm) goes from hanging at 180s to passing in ~10s.Related Issue
Stacked on top of #867. No issue link.
Changes
feat(server): promote sandbox phase on supervisor session connectSupervisorSessionObservertrait.SupervisorSessionRegistryinvokes the observer onregister/remove_if_currentoutside the internal mutex.ComputeRuntime::install_supervisor_observerwires aComputeSessionObserverbridge; the runtime holds aWeak<SupervisorSessionRegistry>to break theArccycle between registry and observer.mark_sandbox_session_connected/mark_sandbox_session_disconnectedflip phase and rewrite theReadycondition withreason=SupervisorConnected/SupervisorDisconnected. Terminal states (Deleting,Error) are preserved.apply_sandbox_update_lockedhandles the register-before-store race: if a driver snapshot arrives and the registry already holds a live session for that sandbox, phase is promoted on the spot.refactor(driver-vm): drop log-grep readiness; always run gvproxyguest_ssh_ready()andready_condition(). The driver no longer ownsReady;monitor_sandboxonly surfacesErrorfor launcher-process failures.runtime.rsnow starts gvproxy unconditionally. With the SSH port forward removed in feat(server,sandbox): supervisor-initiated SSH connect and exec over gRPC-multiplexed relay #867,port_mapwas empty by default, which skipped gvproxy startup entirely — leaving the guest with noeth0and no route to the host gateway. The guest supervisor'sConnectSupervisorstream needs gvproxy to reachhost.containers.internal(rewritten to192.168.127.1inside the guest).VmContext::set_port_map; mark the libkrun FFI binding#[allow(dead_code)].e2e(vm): run smoke against openshell-gateway with the VM compute drivere2e/rust/e2e-vm.shfor the split-binary flow (formeropenshell-vmK8s-in-a-VM binary is gone).--driver-dir target/debugso the gateway picks up the freshly cargo-built driver rather than a stale~/.local/libexec/openshell/openshell-driver-vmfrom a priorinstall-vm.shrun./tmp(macOSAF_UNIXSUN_LENis 104 bytes; worktree paths routinely blow it).rootfs-console.loginline for post-mortem.build:docker:gatewayandvm:builddependencies fromtasks/test.toml'se2e:vm.Testing
mise run pre-commitpasses (lint + format + license headers clean; clippy warnings unchanged from baseline)openshell-serverlib: 255 pass (+8 compute promotion tests, +4 registry observer tests)openshell-driver-vmlib: 17 passopenshell-serverintegration (supervisor_relay_integration): 6 passmise run e2e:vmpasses in ~10s, stable across back-to-back runsChecklist
Notes for the reviewer
feat/supervisor-session-grpc-data, notmain. Merge order: land feat(server,sandbox): supervisor-initiated SSH connect and exec over gRPC-multiplexed relay #867 first, then rebase this onto main.driver-vm/runtime.rscould arguably belong in feat(server,sandbox): supervisor-initiated SSH connect and exec over gRPC-multiplexed relay #867 itself (it's a latent bug in that PR — VMs have no network after the SSH port forward was dropped). Happy to split that hunk off into a separate commit againstfeat/supervisor-session-grpc-datadirectly if that's the preferred landing path.Ready=Truefrom kubelet still wins (the gateway override only applies when phase isProvisioning/Unknown).