Skip to content

fix(eval): include function-call events in invocation_events when skip_summarization is set#5417

Open
Koushik-Salammagari wants to merge 9 commits intogoogle:mainfrom
Koushik-Salammagari:fix/trajectory-eval-skip-summarization
Open

fix(eval): include function-call events in invocation_events when skip_summarization is set#5417
Koushik-Salammagari wants to merge 9 commits intogoogle:mainfrom
Koushik-Salammagari:fix/trajectory-eval-skip-summarization

Conversation

@Koushik-Salammagari
Copy link
Copy Markdown

Link to Issue or Description of Change

Fixes #5410

Description

EvaluationGenerator.convert_events_to_eval_invocations builds
invocation_events (the intermediate tool-call record used by
TrajectoryEvaluator) by collecting all qualifying events and then excluding
the final_event from the list.

The final event is identified via event.is_final_response(), but
is_final_response() returns True for any event with
skip_summarization=True — even events that contain function_call parts
(e.g. tools that use skip_summarization to surface their result directly
without an LLM summarization step). Those events were silently dropped from
invocation_events, causing get_all_tool_calls() to return [] for the
actual invocation. The result: tool_trajectory_avg_score was always 0.0
even when the tool name and args matched the expected exactly.

Root cause: is_final_response() conflates "final user-visible response"
with "should be excluded from tool trajectory". When skip_summarization=True
the function-call event is both the final response and an intermediate step
that must appear in the trajectory.

Fix: in the list comprehension that builds invocation_events, keep an
event even when it equals final_event if it contains function calls:

# before
if e is not final_event

# after
if e is not final_event or e.get_function_calls()

Changes

  • src/google/adk/evaluation/evaluation_generator.py: one-line fix
  • tests/unittests/evaluation/test_evaluation_generator.py: regression test that verifies tool calls are preserved when skip_summarization=True
  • tests/unittests/evaluation/test_trajectory_evaluator.py: end-to-end tests for InvocationEvents intermediate_data format (exact match → 1.0, mismatch → 0.0)

Testing Plan

pytest tests/unittests/evaluation/test_trajectory_evaluator.py \
       tests/unittests/evaluation/test_evaluation_generator.py -v
======================== 47 passed in 1.23s ============================

…in thread pool

When RunConfig.tool_thread_pool_config is enabled, _call_tool_in_thread_pool
used None as a sentinel to distinguish "FunctionTool ran in thread pool" from
"non-FunctionTool sync tool, needs async fallback". Because None is also a
valid return value from any FunctionTool whose underlying function has no
explicit return statement (implicit None), the sentinel check failed and
execution fell through to tool.run_async(), invoking the function a second
time silently.

Replace the None sentinel with a dedicated _SYNC_TOOL_RESULT_UNSET object so
that a legitimate None result from a FunctionTool is correctly returned on the
first execution, without triggering the async fallback path.

Fixes google#5284
…ases

Per reviewer feedback: collapse the two near-identical None tests into a
single @pytest.mark.parametrize test, and add falsy-but-not-None cases
(0, '', {}, False) to prove the sentinel is identity-based and does not
mishandle any falsy return value from a FunctionTool.
…p_summarization is set

EvaluationGenerator.convert_events_to_eval_invocations builds invocation_events
by excluding the final_event from intermediate steps. However, is_final_response()
returns True for any event with skip_summarization=True, even when that event
contains function calls (e.g. tools using skip_summarization to bypass LLM
summarization). Such events were incorrectly excluded from invocation_events,
causing get_all_tool_calls() to return an empty list and
tool_trajectory_avg_score to always be 0.0 despite matching tool calls.

Fix: keep an event in invocation_events even if it is the final_event when
it contains function calls.

Fixes google#5410
@adk-bot adk-bot added the eval [Component] This issue is related to evaluation label Apr 20, 2026
@rohityan rohityan self-assigned this Apr 20, 2026
@rohityan
Copy link
Copy Markdown
Collaborator

Hi @Koushik-Salammagari , Thank you for your contribution! We appreciate you taking the time to submit this pull request. Please fix formatting errors by running autoformat.sh

@rohityan rohityan added the request clarification [Status] The maintainer need clarification or more information from the author label Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

eval [Component] This issue is related to evaluation request clarification [Status] The maintainer need clarification or more information from the author

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tool_trajectory_avg_score returns 0.0 even when tool name and args match exactly

3 participants