Skip to content

fix(client/stdio): allow FIFO cleanup of multiple transports on asyncio (#577)#2484

Open
sakenuGOD wants to merge 7 commits intomodelcontextprotocol:mainfrom
sakenuGOD:fix/577-stdio-fifo-cleanup
Open

fix(client/stdio): allow FIFO cleanup of multiple transports on asyncio (#577)#2484
sakenuGOD wants to merge 7 commits intomodelcontextprotocol:mainfrom
sakenuGOD:fix/577-stdio-fifo-cleanup

Conversation

@sakenuGOD
Copy link
Copy Markdown

Fixes #577.

The bug

stdio_client wraps its reader / writer in anyio.create_task_group() inside the async context manager body (src/mcp/client/stdio.py L180 on main). A task group binds its cancel scope to whichever task entered the generator, and anyio enforces a strict LIFO stack on those scopes within a single task.

If a caller opens two transports from the same task and then closes them in the order they were opened — a pattern that shows up in multi-client orchestrators, DI containers and pytest fixtures — teardown crashes with:

RuntimeError: Attempted to exit a cancel scope that isn't the current task's current cancel scope

(The traceback in #577 matches this exactly. I reproduced it locally against a freshly cloned main with a minimal repro that spawns two python -c ... subprocesses and closes them FIFO.)

This has been biting users for a year — hits asyncio callers using AsyncExitStack, pytest fixtures, custom multi-client managers (see the comment thread on #577 for the list). The recommended workaround until now was "always clean up LIFO", which pushes anyio's cancel-scope discipline onto every consumer of the SDK and breaks naturally with many real code-organisation patterns.

The fix

On asyncio we don't need a task group to spawn the reader / writer — asyncio.create_task is enough. The tasks then live outside any caller-bound cancel scope, so each transport is self-contained and the cleanup order of two transports no longer matters.

  • Backend is detected via sniffio.current_async_library() (already a transitive dependency of anyio).
  • The documented shutdown sequence (close stdin → wait for process with PROCESS_TERMINATION_TIMEOUT → escalate to _terminate_process_tree → close the memory streams) is lifted into a shared _cleanup_process_and_streams helper so the two branches can't drift.
  • Reader / writer cancellation happens after the process-level cleanup so late I/O errors surfaced during teardown are expected and swallowed, matching the pre-existing anyio.ClosedResourceError handling in those functions.
  • On structured-concurrency backends (trio) the anyio task group is kept — trio forbids orphan tasks by design, and cross-task cleanup is fundamentally incompatible with that model. Trio callers still have to clean up LIFO. This is acknowledged in the inline comment on the backend branch.

Tests

Three new regression tests in tests/client/test_stdio.py, all asyncio-only (parametrize(\"anyio_backend\", [\"asyncio\"])):

Test What it pins
test_stdio_client_supports_fifo_cleanup_on_asyncio The exact scenario from #577 — two transports opened, closed FIFO. Verified to fail on the unmodified main branch with the original RuntimeError, so it is a real regression guard, not a tautology.
test_stdio_client_supports_lifo_cleanup_on_asyncio Sanity: historical LIFO path must still work unchanged after the fix.
test_stdio_client_shared_exit_stack_fifo_on_asyncio A single AsyncExitStack around two transports (ExitStack already runs callbacks LIFO, so this case worked before — but worth pinning so a future refactor of the asyncio branch can't silently break it).

pytest tests/client/test_stdio.py --timeout=3013 passed, 1 skipped (the Unix-only SIGINT test is the pre-existing skip; 10 other pre-existing tests + 3 new, none regressed).

Negative-path validation: stashed only src/mcp/client/stdio.py, reran the FIFO test — it fails with the original RuntimeError: Attempted to exit a cancel scope that isn't the current task's current cancel scope. Restored the fix, all three pass.

Scope

Kept strictly to stdio_client. sse.py, streamable_http.py, websocket.py use a similar task-group-in-generator pattern and may need the same treatment — they can be handled in follow-ups if this approach is acceptable.

independently on asyncio (modelcontextprotocol#577)

`stdio_client` wrapped its reader / writer in
`anyio.create_task_group()` inside the async context manager body. The
task group binds its cancel scope to whichever task entered the
generator, and anyio enforces a strict LIFO stack on those scopes
within a single task. If a caller opened two transports from the same
task and then closed them in the order they were opened (FIFO) — which
is what happens in multi-client orchestrators, dependency-injection
containers and pytest fixtures — teardown crashed with:

    RuntimeError: Attempted to exit a cancel scope that isn't the
    current task's current cancel scope

On the asyncio backend we can spawn the reader / writer with
`asyncio.create_task` instead. The background tasks then live outside
any caller-bound cancel scope, so each transport becomes
self-contained and the cleanup order of two transports no longer
matters. Teardown keeps the documented shutdown sequence (close stdin,
wait for the process with `PROCESS_TERMINATION_TIMEOUT`, escalate to
`_terminate_process_tree`, then close the memory streams) in a shared
`_cleanup_process_and_streams` helper so the two branches can't drift.

On structured-concurrency backends (trio) an anyio task group is still
required — trio intentionally forbids orphan tasks, and cross-task
cleanup is fundamentally incompatible with its model. Those callers
still have to clean up LIFO. The backend dispatch goes through
`sniffio.current_async_library()`.

Regression tests (all asyncio-only):

- `test_stdio_client_supports_fifo_cleanup_on_asyncio` — the exact
  scenario from modelcontextprotocol#577. Verified that it FAILS on the pre-fix code
  with the original RuntimeError, and passes on the fixed code.
- `test_stdio_client_supports_lifo_cleanup_on_asyncio` — historical
  LIFO path must still work unchanged.
- `test_stdio_client_shared_exit_stack_fifo_on_asyncio` — a single
  AsyncExitStack around two transports (already LIFO under the hood,
  but worth pinning so future refactors cannot silently break it).

All 13 stdio tests pass locally (10 pre-existing + 3 new, 1 Unix-only
sigint test skipped on Windows).

Fixes modelcontextprotocol#577
… background

reader/writer and close streams on crash

Self-review of the first commit caught a regression: the asyncio
branch spawned stdout_reader / stdin_writer via asyncio.create_task
and did `try: await task except BaseException: pass` during cleanup,
which silently swallowed any exception the reader or writer raised.
An anyio task group would have re-raised that through the `async
with` block — the asyncio path has to reproduce that contract to
keep the fix observably equivalent.

Two changes:

1. `add_done_callback(_on_background_task_done)` — if a background
   task crashes while the caller is still inside the `yield`, close
   the memory streams immediately so any in-flight user read wakes
   up with `ClosedResourceError` instead of hanging forever. Mirrors
   the scope-cancellation side effect of a task group. The callback
   logs the exception at debug so operators can trace crashes that
   we only surface on the way out.

2. On `__aexit__` we now collect task results and re-raise the first
   non-cancellation, non-closed-resource exception. CancelledError
   and anyio.ClosedResourceError are expected during the teardown we
   just forced in `_cleanup_process_and_streams`, so they are
   swallowed; anything else (a real crash) propagates.

New regression test:
`test_stdio_client_reader_crash_propagates_on_asyncio` — injects a
TextReceiveStream that raises on the first `__anext__`, asserts the
exception is visible at the async-with exit. Confirmed it FAILS on
the first-commit version ("Failed: DID NOT RAISE"), so it is real
regression coverage, not a tautology.

All 14 stdio tests pass (10 pre-existing + 4 new, 1 Unix-only skip).
Full `tests/client/` pass (203 passed, 1 pre-existing xfail).
@sakenuGOD
Copy link
Copy Markdown
Author

Self-review follow-up in 1d56c4a.

While double-checking the fix I caught a real regression in the first commit: the asyncio branch was spawning stdout_reader / stdin_writer via asyncio.create_task and doing try: await task except BaseException: pass during cleanup, which silently swallowed any exception the reader or writer raised. An anyio task group would have surfaced that through the async with block, so the fix was observably weaker than the code it replaced.

Fix (stacked on top of the original commit):

  1. add_done_callback(_on_background_task_done) — if a background task crashes while the caller is still inside the yield, close the memory streams immediately so any in-flight user read wakes up with ClosedResourceError instead of hanging forever. Mirrors the scope-cancellation side effect of a task group. Logs the exception at debug on the way through.

  2. On __aexit__ we now collect task results and re-raise the first non-cancellation, non-closed-resource exception. CancelledError and anyio.ClosedResourceError are expected during the teardown we just forced in _cleanup_process_and_streams, so they're swallowed; anything else propagates.

New regression test: test_stdio_client_reader_crash_propagates_on_asyncio injects a TextReceiveStream that raises on the first __anext__ and asserts the exception surfaces at the async-with exit. Confirmed it fails on the first-commit version with Failed: DID NOT RAISE, so it's real regression coverage, not a tautology.

Results: 14 stdio tests pass (10 pre-existing + 4 new, 1 Unix-only SIGINT skip). Full tests/client/ suite: 203 passed, 1 pre-existing xfail, no new failures.

…nd pyright

CI on modelcontextprotocol#2484 surfaced three categories of failures on every Python
version:

  1. ruff D212 — multi-line docstring summaries were split from the
     opening ``"""``. Collapsed the first line of each docstring
     touched by this PR to sit on the opening-quote line.
  2. ruff C901 — the asyncio-vs-task-group dispatch pushed
     ``stdio_client``'s cyclomatic complexity from below the
     project's max of 24 up to 29. Extracted the two branches into
     dedicated async context managers:

        _asyncio_background_tasks(...) — the asyncio path with
            add_done_callback to close streams on crash, and the
            ``__aexit__`` that re-raises non-cancellation exceptions.
        _anyio_task_group_background(...) — the trio / other-backend
            path that keeps the original task group.

     ``stdio_client`` now just picks one based on
     ``sniffio.current_async_library()`` and uses it as an
     ``async with`` alongside ``process``. The backend-dispatch
     behaviour is identical to the first commit — only the shape
     of the code changed.
  3. pyright reportArgumentType — the callable types for the helpers
     needed to be ``Callable[[], Coroutine[Any, Any, None]]`` because
     ``asyncio.create_task`` requires a Coroutine, not an Awaitable.
     Imported ``Coroutine`` and ``Any`` accordingly.

No behavioural change: all four regression tests still pass, the
pre-existing 10 stdio tests are unchanged, and the full
``tests/client/`` suite goes through clean (203 passed, 1 pre-existing
xfail, 1 Unix-only SIGINT skip).
…overage

Second CI round caught two more things the local run did not:

1. pyright on the CI runs in a stricter mode than the default, and
   flagged every new test:
     - `anyio_backend` parametrize fixture needs an explicit `str` type
     - `monkeypatch` needs `pytest.MonkeyPatch`
     - `*args, **kwargs` on the stub stream class need `Any`
     - `__aiter__` / `__anext__` need return types
   All annotated; pyright now reports 0 errors on both files.

2. The project's coverage floor is `fail_under = 100`. Extracting the
   trio branch into its own helper function left lines 184-187 (the
   body of `_anyio_task_group_background`) and the `else:` on line
   304 uncovered, because every existing test runs on asyncio.
   Added `test_stdio_client_supports_lifo_cleanup_on_trio` —
   parametrised with `anyio_backend=["trio"]`, it exercises the trio
   branch end-to-end with the LIFO cleanup pattern that is valid on
   trio. Removed the `# pragma: lax no cover` markers on the helper
   and the else branch since coverage is now genuine, not silenced.

15/16 stdio tests pass locally (15 real pass + 1 Unix-only SIGINT
skip). pyright: 0 errors. ruff: clean.
… branches

Second CI round got coverage up to 99.97% but the last uncovered
lines are defensive cleanup paths that normal teardown does not
exercise and that do not warrant a dedicated test each:

  - `_on_done` callback's "task cancelled" early-return (normal
    teardown exits the reader/writer cleanly through stream-close,
    not via explicit cancellation).
  - `task.cancel()` guard when the task is somehow still pending by
    the time we reach the finally block (same reason — tasks should
    be done already).
  - CancelledError and ClosedResourceError catches in the result
    collection loop (symmetric with the case above; those arise
    only if the task was still active when we began teardown).

Each line is marked with a terse ``# pragma: no cover`` pointing at
the reason, matching the pattern already used elsewhere in this
file (e.g. ``except Exception:  # pragma: no cover`` on line 143
inside ``_on_done``).
…says are reached

Third CI round: coverage reached 100.00% on the previous commit, but
the project also runs ``strict-no-cover`` which complains when a
line is marked ``# pragma: no cover`` yet actually gets hit by the
test suite (it prevents coverage rot from creeping in under pragma
cover).

It flagged two of my pragmas:
  src/mcp/client/stdio.py:161 — `except asyncio.CancelledError:`
  src/mcp/client/stdio.py:163 — `except anyio.ClosedResourceError:`

Both are reached on Linux during the normal teardown of the new
stdio tests, so the pragma was wrong. Removed them. (The Windows
runs did not hit these branches — which explains why Windows matrix
passed last round but Linux did not; strict-no-cover runs on the
Linux jobs.)

Kept the pragmas on lines 131, 155, 166 — those are the defensive
branches that still are not reached by any existing test on either
platform and still warrant the marker.

15 stdio tests pass locally; ruff + pyright clean.
…dependent exception catches

The two catches I touched in the previous commit
(``except asyncio.CancelledError`` and
``except anyio.ClosedResourceError`` in the teardown result-collection
loop) turn out to be timing-dependent:

  - On Windows the teardown of the background reader/writer finishes
    before the loop reaches ``await task``, so ``await task`` never
    raises CancelledError / ClosedResourceError and the branches are
    unreached — needs pragma.
  - On some Linux runs the reader/writer is still running when we
    cancel and await, so the branches *are* reached — strict-no-cover
    complained that a plain ``pragma: no cover`` was wrong there.

Switched both to ``pragma: lax no cover``, which the project's
coverage config already defines as the opt-out marker for exactly
this case (exists on other lines in the same file, e.g. 247).
It is excluded from coverage reporting but does not trip
strict-no-cover when the line happens to be hit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RuntimeError: Attempted to exit cancel scope in a different task when cleaning up multiple MCPClient instances out-of-order

1 participant