fix: improve pull resume reliability and progress bar display on retry by ericcurtin · Pull Request #873 · docker/model-runner

ericcurtin · 2026-04-20T12:07:20Z

Only accept HTTP 206 with a matching Content-Range start offset as a
successful Range response in rangeTransport; a 200 response means the
server ignored the Range header and is sending from byte 0, so
appending it to the partial file would corrupt the blob. A misbehaving
server returning 206 with a different range is also rejected.
Preserve .incomplete files on all read errors, not just context
cancellation, so every kind of transient failure (network reset, stream
error, etc.) can be resumed on the next attempt.
Stop rolling back fully-downloaded layer blobs on Write failure.
Layer blobs are content-addressed and immutable; keeping them lets a
subsequent pull skip already-completed layers entirely instead of
re-downloading them. The manifest, config, and index rollback still
runs to leave the store in a consistent non-indexed state.
Print a blank line to stdout before each retry so that orphaned
progress bars from the failed attempt are visually separated from the
new attempt's bars, preventing garbled terminal output.
Increase Pull and Push max retries from 3 to 4 and update
TestPullMaxRetriesExhausted to match (5 total attempts, error message
says "after 4 retries").

sourcery-ai

Hey - I've found 2 issues, and left some high level feedback:

In withRetries, printing a blank line to stdout on every retry may interfere with consumers that parse stdout; consider gating this behind a TTY/interactive check or restricting it to human-oriented output modes.
The _ = results assignment in Write looks like a workaround to silence unused-variable warnings; if results truly isn’t needed after the early checks, consider restructuring to avoid the no-op assignment (e.g., scoping or moving the earlier usage closer) to keep the intent clearer.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In `withRetries`, printing a blank line to stdout on every retry may interfere with consumers that parse stdout; consider gating this behind a TTY/interactive check or restricting it to human-oriented output modes.
- The `_ = results` assignment in `Write` looks like a workaround to silence unused-variable warnings; if `results` truly isn’t needed after the early checks, consider restructuring to avoid the no-op assignment (e.g., scoping or moving the earlier usage closer) to keep the intent clearer.

## Individual Comments

### Comment 1
<location path="pkg/distribution/oci/remote/remote.go" line_range="200-204" />
<code_context>
-			if rs := GetRangeSuccess(req.Context()); rs != nil {
-				rs.Add(digest, requestedOffset)
-			}
+	// If we requested a Range, record success only when the server honoured it
+	// with 206 Partial Content. A 200 response means the server ignored the Range
+	// header and is sending the full file from byte 0; appending that stream to
+	// the existing partial file would produce a corrupt blob.
+	if requestedOffset > 0 && resp.StatusCode == http.StatusPartialContent {
+		if rs := GetRangeSuccess(req.Context()); rs != nil {
+			rs.Add(digest, requestedOffset)
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Consider validating the Content-Range header in addition to StatusPartialContent.

Relying only on 206 still trusts the server to honour the requested range. A misbehaving server could return 206 with a `Content-Range` that starts at a different offset (e.g., 0), and `WriteBlob` would still treat this as safe to append, risking blob corruption. Please also parse and validate `Content-Range` so that we only call `rs.Add` when the start offset matches `requestedOffset`.

Suggested implementation:

```golang
	// If we requested a Range, record success only when the server honoured it
	// with 206 Partial Content and a matching Content-Range. A 200 response means
	// the server ignored the Range header and is sending the full file from byte 0;
	// appending that stream to the existing partial file would produce a corrupt blob.
	if requestedOffset > 0 && resp.StatusCode == http.StatusPartialContent {
		// Content-Range: bytes START-END/TOTAL
		if cr := resp.Header.Get("Content-Range"); cr != "" {
			var unit string
			var start, end int64
			var total string // can be "*" or a number; we don't need to interpret it here

			// Fail closed: only treat this as a successful range when we can parse it
			// and the start offset matches requestedOffset.
			if _, err := fmt.Sscanf(cr, "%s %d-%d/%s", &unit, &start, &end, &total); err == nil &&
				unit == "bytes" && start == requestedOffset {
				if rs := GetRangeSuccess(req.Context()); rs != nil {
					rs.Add(digest, requestedOffset)
				}
			}
		}
	}

	return resp, nil

```

1. Ensure `fmt` is imported in `pkg/distribution/oci/remote/remote.go`:
   - If it is not already imported, add `fmt` to the import block.
2. If the project has a shared helper for parsing `Content-Range` elsewhere, you may want to replace the inline `fmt.Sscanf` logic with that helper for consistency.
</issue_to_address>

### Comment 2
<location path="pkg/distribution/internal/store/store_test.go" line_range="438-439" />
<code_context>
 		layerPath := filepath.Join(storePath, "blobs", parts[0], parts[1])
-		if _, err := os.Stat(layerPath); !errors.Is(err, os.ErrNotExist) {
-			t.Fatalf("expected layer blob %q to be cleaned up, stat error: %v", layerPath, err)
+		if _, err := os.Stat(layerPath); err != nil {
+			t.Fatalf("expected layer blob %q to be retained for future resume, stat error: %v", layerPath, err)
 		}
 	}
</code_context>
<issue_to_address>
**suggestion (testing):** Add a test case that verifies retained layer blobs are actually reused on a subsequent write/pull, not just that they exist on disk.

The updated assertion covers retention after rollback, but it doesn’t yet prove that retained blobs are reused. Please add or extend a test that performs a second `Write` of the same image after a failed first attempt, ensures the blob-fetch logic detects existing blobs (e.g. via `hasBlob`), and verifies that the second write does not recreate or re-download those blobs while leaving the final store state consistent. This will validate the end-to-end resume optimisation, not just persistence on disk.

Suggested implementation:

```golang
	// Layer blobs are content-addressed and are intentionally retained even
	// after a failed write. They may be reused by a subsequent pull of the
	// same or another model, and they allow the download to resume rather
	// than restart from byte 0. Only the manifest, config, and index are
	// rolled back to leave the store in a consistent (non-indexed) state.
	//
	// First, assert that all layer blobs still exist on disk after the failed
	// write and record their metadata so we can later prove they were not
	// re-written during a resumed write.
	layerInfos := make(map[string]os.FileInfo, len(diffIDs))
	for _, digestStr := range diffIDs {
		parts := strings.SplitN(digestStr, ":", 2)
		if len(parts) != 2 {
			t.Fatalf("unexpected diffID format: %q", digestStr)
		}
		layerPath := filepath.Join(storePath, "blobs", parts[0], parts[1])

		info, err := os.Stat(layerPath)
		if err != nil {
			t.Fatalf("expected layer blob %q to be retained for future resume, stat error: %v", layerPath, err)
		}

		layerInfos[layerPath] = info
	}

	// Perform a second write of the same content to simulate resuming the
	// interrupted pull. The store should detect the existing blobs and avoid
	// re-downloading or re-writing them, while leaving the store in a fully
	// indexed, consistent state on success.
	if err := store.Write(ctx, testRef, remoteDesc, fetcher); err != nil {
		t.Fatalf("second Write (resume) failed: %v", err)
	}

	// Verify that the resumed write re-used the retained layer blobs instead
	// of re-creating them by asserting size and mtime are unchanged.
	for layerPath, before := range layerInfos {
		after, err := os.Stat(layerPath)
		if err != nil {
			t.Fatalf("expected retained layer blob %q to still exist after resume, stat error: %v", layerPath, err)
		}

		if before.Size() != after.Size() {
			t.Fatalf("expected retained layer blob %q to be reused without size change (before=%d, after=%d)", layerPath, before.Size(), after.Size())
		}

		if !before.ModTime().Equal(after.ModTime()) {
			t.Fatalf("expected retained layer blob %q not to be rewritten on resume (mtime before=%v, after=%v)", layerPath, before.ModTime(), after.ModTime())
		}

```

The replacement above assumes the following names already exist in the surrounding test:
- `store` – the `Store` under test.
- `ctx` – the `context.Context` used for writes.
- `testRef` – the reference used for the original failed write (e.g. model/image ref).
- `remoteDesc` – the root descriptor passed into `Write` for the original failed write.
- `fetcher` – the `remotes.Fetcher` (or equivalent) used by `Write` to retrieve blobs.

To fully implement the “reuse” verification as described in your review:
1. Ensure this code is inside the same test that triggers the initial failed `Write`, *after* the failure has been asserted and `diffIDs` has been captured.
2. If the test currently uses different variable names for the second write (e.g. `desc` instead of `remoteDesc`, `ref` instead of `testRef`), adjust the `store.Write` call’s arguments accordingly.
3. If `store.Write` has a different signature (for example `Write(ctx, desc, fetcher)` or `Write(ctx, ref, desc, fetcher, opts...)`), update the call in the replacement block to match the actual signature and include any required options.
4. If the existing test uses a custom fetcher that can simulate a failure on the first write, reuse the same fetcher instance for the second `Write` so that the test validates the “hasBlob” shortcut path (i.e. fetcher should not have to re-upload or re-download layer blobs). Optionally, you can extend that fetcher to count how many times layer digests are fetched and assert that the second `Write` does not increment those counters for layer blobs.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2026-04-20T12:09:24Z

+		if _, err := os.Stat(layerPath); err != nil {
+			t.Fatalf("expected layer blob %q to be retained for future resume, stat error: %v", layerPath, err)


suggestion (testing): Add a test case that verifies retained layer blobs are actually reused on a subsequent write/pull, not just that they exist on disk.

The updated assertion covers retention after rollback, but it doesn’t yet prove that retained blobs are reused. Please add or extend a test that performs a second Write of the same image after a failed first attempt, ensures the blob-fetch logic detects existing blobs (e.g. via hasBlob), and verifies that the second write does not recreate or re-download those blobs while leaving the final store state consistent. This will validate the end-to-end resume optimisation, not just persistence on disk.

Suggested implementation:

// Layer blobs are content-addressed and are intentionally retained even // after a failed write. They may be reused by a subsequent pull of the // same or another model, and they allow the download to resume rather // than restart from byte 0. Only the manifest, config, and index are // rolled back to leave the store in a consistent (non-indexed) state. // // First, assert that all layer blobs still exist on disk after the failed // write and record their metadata so we can later prove they were not // re-written during a resumed write. layerInfos := make(map[string]os.FileInfo, len(diffIDs)) for _, digestStr := range diffIDs { parts := strings.SplitN(digestStr, ":", 2) if len(parts) != 2 { t.Fatalf("unexpected diffID format: %q", digestStr) } layerPath := filepath.Join(storePath, "blobs", parts[0], parts[1]) info, err := os.Stat(layerPath) if err != nil { t.Fatalf("expected layer blob %q to be retained for future resume, stat error: %v", layerPath, err) } layerInfos[layerPath] = info } // Perform a second write of the same content to simulate resuming the // interrupted pull. The store should detect the existing blobs and avoid // re-downloading or re-writing them, while leaving the store in a fully // indexed, consistent state on success. if err := store.Write(ctx, testRef, remoteDesc, fetcher); err != nil { t.Fatalf("second Write (resume) failed: %v", err) } // Verify that the resumed write re-used the retained layer blobs instead // of re-creating them by asserting size and mtime are unchanged. for layerPath, before := range layerInfos { after, err := os.Stat(layerPath) if err != nil { t.Fatalf("expected retained layer blob %q to still exist after resume, stat error: %v", layerPath, err) } if before.Size() != after.Size() { t.Fatalf("expected retained layer blob %q to be reused without size change (before=%d, after=%d)", layerPath, before.Size(), after.Size()) } if !before.ModTime().Equal(after.ModTime()) { t.Fatalf("expected retained layer blob %q not to be rewritten on resume (mtime before=%v, after=%v)", layerPath, before.ModTime(), after.ModTime()) }

The replacement above assumes the following names already exist in the surrounding test:

store – the Store under test.

ctx – the context.Context used for writes.

testRef – the reference used for the original failed write (e.g. model/image ref).

remoteDesc – the root descriptor passed into Write for the original failed write.

fetcher – the remotes.Fetcher (or equivalent) used by Write to retrieve blobs.

To fully implement the “reuse” verification as described in your review:

Ensure this code is inside the same test that triggers the initial failed Write, after the failure has been asserted and diffIDs has been captured.

If the test currently uses different variable names for the second write (e.g. desc instead of remoteDesc, ref instead of testRef), adjust the store.Write call’s arguments accordingly.

If store.Write has a different signature (for example Write(ctx, desc, fetcher) or Write(ctx, ref, desc, fetcher, opts...)), update the call in the replacement block to match the actual signature and include any required options.

If the existing test uses a custom fetcher that can simulate a failure on the first write, reuse the same fetcher instance for the second Write so that the test validates the “hasBlob” shortcut path (i.e. fetcher should not have to re-upload or re-download layer blobs). Optionally, you can extend that fetcher to count how many times layer digests are fetched and assert that the second Write does not increment those counters for layer blobs.

gemini-code-assist

Code Review

This pull request enhances the distribution store's efficiency and reliability by retaining layer blobs after failed operations to support resuming downloads and ensuring that only '206 Partial Content' responses validate range requests to prevent data corruption. Additionally, it improves the CLI retry experience by adding visual separation between attempts. Feedback was provided regarding a redundant blank identifier assignment in the store's write logic.

gemini-code-assist · 2026-04-20T12:09:25Z

+	// re-download on the next attempt instead of resuming only the layer
+	// that was actually in progress. The manifest/config/index cleanup below
+	// is sufficient to leave the store in a consistent state.
+	_ = results // results used above for error checking; layer blobs are retained


The statement _ = results is redundant and should be removed. In Go, the blank identifier is typically used to suppress unused variable errors; however, results is already utilized in the error-checking loop at line 368. Removing this line improves code clarity and adheres to the pragmatism principle by eliminating unnecessary code that does not contribute to the logic or maintainability of the system.

References

Pragmatism — Does the solution match the complexity of the problem? Is the simplest viable approach being used? Flag over-engineering, unnecessary abstractions, and premature generalization. ^(link)

ilopezluna

I agree with Sourcery and Gemini comments but nothing blocking, we can add the suggestion in a follow up PR

- Only accept HTTP 206 with a matching Content-Range start offset as a successful Range response in rangeTransport; a 200 response means the server ignored the Range header and is sending from byte 0, so appending it to the partial file would corrupt the blob. A misbehaving server returning 206 with a different range is also rejected. - Preserve .incomplete files on all read errors, not just context cancellation, so every kind of transient failure (network reset, stream error, etc.) can be resumed on the next attempt. - Stop rolling back fully-downloaded layer blobs on Write failure. Layer blobs are content-addressed and immutable; keeping them lets a subsequent pull skip already-completed layers entirely instead of re-downloading them. The manifest, config, and index rollback still runs to leave the store in a consistent non-indexed state. - Print a blank line to stdout before each retry so that orphaned progress bars from the failed attempt are visually separated from the new attempt's bars, preventing garbled terminal output. - Increase Pull and Push max retries from 3 to 4 and update TestPullMaxRetriesExhausted to match (5 total attempts, error message says "after 4 retries"). Signed-off-by: Eric Curtin <eric.curtin@docker.com>

sourcery-ai bot reviewed Apr 20, 2026

View reviewed changes

gemini-code-assist bot reviewed Apr 20, 2026

View reviewed changes

ilopezluna approved these changes Apr 20, 2026

View reviewed changes

ericcurtin force-pushed the fix-pull-resume-and-progress-bars branch from 3a50841 to 5959463 Compare April 20, 2026 14:51

ericcurtin force-pushed the fix-pull-resume-and-progress-bars branch from ae108b3 to 6e9c203 Compare April 20, 2026 16:14

ericcurtin merged commit 4d9aca2 into main Apr 20, 2026
17 checks passed

ericcurtin deleted the fix-pull-resume-and-progress-bars branch April 20, 2026 16:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve pull resume reliability and progress bar display on retry#873

fix: improve pull resume reliability and progress bar display on retry#873
ericcurtin merged 1 commit intomainfrom
fix-pull-resume-and-progress-bars

ericcurtin commented Apr 20, 2026 •

edited

Loading

Uh oh!

sourcery-ai bot left a comment

Uh oh!

Uh oh!

sourcery-ai bot Apr 20, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 20, 2026

Uh oh!

ilopezluna left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if _, err := os.Stat(layerPath); err != nil {
		t.Fatalf("expected layer blob %q to be retained for future resume, stat error: %v", layerPath, err)

Conversation

ericcurtin commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sourcery-ai bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

ilopezluna left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ericcurtin commented Apr 20, 2026 •

edited

Loading