fix(eval): include function-call events in invocation_events when skip_summarization is set#5417
Open
Koushik-Salammagari wants to merge 9 commits intogoogle:mainfrom
Conversation
…in thread pool When RunConfig.tool_thread_pool_config is enabled, _call_tool_in_thread_pool used None as a sentinel to distinguish "FunctionTool ran in thread pool" from "non-FunctionTool sync tool, needs async fallback". Because None is also a valid return value from any FunctionTool whose underlying function has no explicit return statement (implicit None), the sentinel check failed and execution fell through to tool.run_async(), invoking the function a second time silently. Replace the None sentinel with a dedicated _SYNC_TOOL_RESULT_UNSET object so that a legitimate None result from a FunctionTool is correctly returned on the first execution, without triggering the async fallback path. Fixes google#5284
…ases
Per reviewer feedback: collapse the two near-identical None tests into a
single @pytest.mark.parametrize test, and add falsy-but-not-None cases
(0, '', {}, False) to prove the sentinel is identity-based and does not
mishandle any falsy return value from a FunctionTool.
…p_summarization is set EvaluationGenerator.convert_events_to_eval_invocations builds invocation_events by excluding the final_event from intermediate steps. However, is_final_response() returns True for any event with skip_summarization=True, even when that event contains function calls (e.g. tools using skip_summarization to bypass LLM summarization). Such events were incorrectly excluded from invocation_events, causing get_all_tool_calls() to return an empty list and tool_trajectory_avg_score to always be 0.0 despite matching tool calls. Fix: keep an event in invocation_events even if it is the final_event when it contains function calls. Fixes google#5410
Collaborator
|
Hi @Koushik-Salammagari , Thank you for your contribution! We appreciate you taking the time to submit this pull request. Please fix formatting errors by running autoformat.sh |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Link to Issue or Description of Change
Fixes #5410
Description
EvaluationGenerator.convert_events_to_eval_invocationsbuildsinvocation_events(the intermediate tool-call record used byTrajectoryEvaluator) by collecting all qualifying events and then excludingthe
final_eventfrom the list.The final event is identified via
event.is_final_response(), butis_final_response()returnsTruefor any event withskip_summarization=True— even events that containfunction_callparts(e.g. tools that use
skip_summarizationto surface their result directlywithout an LLM summarization step). Those events were silently dropped from
invocation_events, causingget_all_tool_calls()to return[]for theactual invocation. The result:
tool_trajectory_avg_scorewas always 0.0even when the tool name and args matched the expected exactly.
Root cause:
is_final_response()conflates "final user-visible response"with "should be excluded from tool trajectory". When
skip_summarization=Truethe function-call event is both the final response and an intermediate step
that must appear in the trajectory.
Fix: in the list comprehension that builds
invocation_events, keep anevent even when it equals
final_eventif it contains function calls:Changes
src/google/adk/evaluation/evaluation_generator.py: one-line fixtests/unittests/evaluation/test_evaluation_generator.py: regression test that verifies tool calls are preserved whenskip_summarization=Truetests/unittests/evaluation/test_trajectory_evaluator.py: end-to-end tests forInvocationEventsintermediate_data format (exact match → 1.0, mismatch → 0.0)Testing Plan