Skip to content

feat(observability): add mothership tracing#4253

Open
Sg312 wants to merge 28 commits intostagingfrom
dev
Open

feat(observability): add mothership tracing#4253
Sg312 wants to merge 28 commits intostagingfrom
dev

Conversation

@Sg312
Copy link
Copy Markdown
Collaborator

@Sg312 Sg312 commented Apr 22, 2026

Summary

Brief description of what this PR does and why.

Fixes #(issue)

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation
  • Other: ___________

Testing

How has this been tested? What should reviewers focus on?

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)lpful -->

@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 22, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Apr 22, 2026 3:18am

Request Review

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 22, 2026

Greptile Summary

This PR introduces comprehensive end-to-end OTel tracing for the mothership copilot pipeline, linking Sim (Node.js) and Go spans into a single trace via W3C trace-context propagation. Key additions include a new otel.ts module with root/child span helpers, a custom business-span sampler, a MothershipOriginSpanProcessor for span renaming and tagging, fetchGo for instrumented outbound HTTP, and explicit OTel-context re-entry inside ReadableStream.start to survive the Next.js handler→stream ALS boundary.

Confidence Score: 5/5

Safe to merge — all findings are P2 suggestions with no functional impact on production tracing correctness.

The implementation is well-structured with careful attention to ALS boundary issues, span lifecycle correctness (no leaked spans), and explicit context propagation. All three findings are P2: the serviceName override is a documentation/configurability concern for forks, the tracer name inconsistency is cosmetic, and the non-null assertions don't change runtime behavior.

apps/sim/instrumentation-node.ts — the serviceName: 'mothership' override and its interaction with the documented user-configurable telemetry.config.ts field.

Important Files Changed

Filename Overview
apps/sim/instrumentation-node.ts OTel bootstrap rewritten: introduces custom sampler, MothershipOriginSpanProcessor, and OTLP exporter; hardcodes serviceName: 'mothership' overriding the user-configurable telemetry.config.ts field.
apps/sim/lib/copilot/request/otel.ts New module providing all OTel root/child span helpers for copilot lifecycle including startCopilotOtelRoot, withCopilotSpan, withCopilotToolSpan, and span-status policy helpers.
apps/sim/lib/copilot/request/go/fetch.ts New fetchGo wrapper that instruments all Sim→Go HTTP calls in an OTel child span and propagates W3C traceparent headers.
apps/sim/lib/copilot/request/lifecycle/start.ts SSE stream lifecycle updated: re-enters root OTel context inside ReadableStream.start to survive the Next.js handler→stream ALS boundary.
apps/sim/app/api/copilot/chat/stream/route.ts Stream resume route creates an explicit root span parented via traceparent header echo; span is correctly ended in all early-return paths and stream finally block.
apps/sim/lib/copilot/request/tools/executor.ts Tool execution wrapped in withCopilotToolSpan via new executeToolAndReportInner inner function; adds attachment-shape extraction for image metrics.
apps/sim/lib/copilot/request/lifecycle/finalize.ts Finalization wrapped in its own OTel span with outcome and publisher state; uses a separate tracer name instead of the shared getCopilotTracer() helper.
apps/sim/lib/copilot/request/go/propagation.ts New module: W3C trace context injection and extraction via W3CTraceContextPropagator.
apps/sim/lib/copilot/request/session/abort-reason.ts New zero-dependency module defining AbortReason vocab and isExplicitStopReason predicate to avoid circular imports between OTel and abort layers.
apps/sim/lib/copilot/request/tools/files.ts File output write wrapped in withCopilotSpan; non-null assertions added for context.workspaceId! and context.userId! which are typed as optional.

Sequence Diagram

sequenceDiagram
    participant Browser
    participant Next as Next.js (sim)
    participant OTel as OTel SDK
    participant Go as Go Orchestrator
    participant Tools as Tool Executor

    Browser->>Next: POST /api/copilot/chat
    Next->>OTel: startCopilotOtelRoot() → gen_ai.agent.execute (ROOT_CONTEXT)
    Note over Next,OTel: requestId = traceId

    Next->>Go: fetchGo() + W3C traceparent injected
    Go->>Go: child spans under Sim trace
    Go-->>Next: SSE stream

    loop Stream events
        Next->>Next: withCopilotSpan(copilot.*)
        Next->>Tools: executeToolAndReport()
        Tools->>OTel: withCopilotToolSpan() → tool.execute span
        Tools-->>Go: tool result
    end

    Next->>OTel: finalizeStream() → copilot.finalize.stream
    Next->>OTel: otelRoot.finish() → ends gen_ai.agent.execute

    Browser->>Next: GET /api/copilot/chat/stream (reconnect)
    Next->>OTel: startSpan(copilot.resume.request) parented via traceparent echo
    Next->>Next: ReadableStream SSE poll loop
    Next->>OTel: rootSpan.end() in finally block
Loading

Reviews (1): Last reviewed commit: "Fix lint" | Re-trigger Greptile

Comment on lines +149 to +150
serviceName: 'mothership',
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 serviceName override silently bypasses user-configurable field

telemetry.config.ts explicitly documents serviceName as configurable for forks of the repo ("You can change this for your fork"), but the code always overwrites it with 'mothership' on this line. Any fork that has customized the service name in telemetry.config.ts will silently get 'mothership' exported in their spans, breaking their own trace backends. The config file's documentation and the override should be reconciled — either drop the config field or conditionally apply this default.

import type { OrchestratorResult } from '@/lib/copilot/request/types'

const logger = createLogger('CopilotStreamFinalize')
const getTracer = () => trace.getTracer('sim-copilot-finalize', '1.0.0')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Inconsistent tracer name

getTracer() here returns trace.getTracer('sim-copilot-finalize', '1.0.0'), while every other mothership span uses getCopilotTracer() (tracer 'sim-ai-platform'). The tracer name is metadata and doesn't affect routing, but this creates an inconsistent otel.library.name dimension in the OTLP backend. The span name copilot.finalize.stream is still covered by ALLOWED_SPAN_PREFIXES so sampling is unaffected, but aligning with the shared helper would improve maintainability.

Comment on lines +207 to +208
[TraceAttr.CopilotOutputFileOutcome]: CopilotOutputFileOutcome.Uploaded,
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Non-null assertion on optional field

context.workspaceId is typed as string | undefined in ToolExecutionContext. The added ! suppresses the TypeScript error at compile time but provides no runtime guard — if workspaceId is actually undefined, uploadWorkspaceFile receives undefined silently. An explicit guard or a fallback default would be safer than a blanket assertion.

@cursor
Copy link
Copy Markdown

cursor Bot commented Apr 22, 2026

PR Summary

Medium Risk
Touches multiple copilot/billing API routes and chat streaming/stop/abort flows to propagate W3C traceparent and add span/attribute logic; mistakes could affect request correlation or edge-case behavior in streaming termination and tool confirmations.

Overview
Adds end-to-end OTel tracing for mothership by introducing a canonical trace/attribute vocabulary (trace-attribute-values-v1) and wiring server routes to parent under Go spans via withIncomingGoSpan, while outbound Sim→Go calls now go through fetchGo to create child spans and propagate traceparent.

Updates chat lifecycle to return traceparent on chat POST and echo it on side-channel requests (stop/abort/confirm/stream resume), adds richer span outcomes/metrics across routes, and instruments async-run DB operations with per-operation spans (excluding high-frequency poll paths).

Fixes a few behavior edges while instrumenting: /copilot/confirm now returns 500 (not 400) when the durable update fails, /chat/stop can persist an optional requestId, stream resume replay stamps terminal synthetic events with the latest request id, and the client stop/finalize paths now dispatch queued follow-ups reliably and treat tool events as partial when status=generating.

Reworks Node OTel bootstrap to export only allowlisted spans, add mothership.origin tagging/name prefixing, support standard OTLP endpoint/headers env vars, and implement ratio-based sampling with improved export error visibility.

Reviewed by Cursor Bugbot for commit fad203d. Configure here.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit fad203d. Configure here.

}
}
return 1.0
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default sampling ratio increased from 10% to 100%

Medium Severity

resolveSamplingRatio defaults to 1.0 (100%) when no env var is set, whereas the previous code hardcoded TraceIdRatioBasedSampler(0.1) (10%). This is a 10× increase in default trace export volume. Combined with the removal of ParentBasedSampler, business-root spans that were previously sampled at 10% are now always exported, and spans whose Go-side parent was unsampled are now independently re-sampled at 100% — potentially creating broken/orphan traces in backends. Deployments that relied on the old 10% default will see a significant cost and throughput increase without any configuration change.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit fad203d. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants