Skip to content

fix: improve pull resume reliability and progress bar display on retry#873

Merged
ericcurtin merged 1 commit intomainfrom
fix-pull-resume-and-progress-bars
Apr 20, 2026
Merged

fix: improve pull resume reliability and progress bar display on retry#873
ericcurtin merged 1 commit intomainfrom
fix-pull-resume-and-progress-bars

Conversation

@ericcurtin
Copy link
Copy Markdown
Contributor

@ericcurtin ericcurtin commented Apr 20, 2026

  • Only accept HTTP 206 with a matching Content-Range start offset as a
    successful Range response in rangeTransport; a 200 response means the
    server ignored the Range header and is sending from byte 0, so
    appending it to the partial file would corrupt the blob. A misbehaving
    server returning 206 with a different range is also rejected.

  • Preserve .incomplete files on all read errors, not just context
    cancellation, so every kind of transient failure (network reset, stream
    error, etc.) can be resumed on the next attempt.

  • Stop rolling back fully-downloaded layer blobs on Write failure.
    Layer blobs are content-addressed and immutable; keeping them lets a
    subsequent pull skip already-completed layers entirely instead of
    re-downloading them. The manifest, config, and index rollback still
    runs to leave the store in a consistent non-indexed state.

  • Print a blank line to stdout before each retry so that orphaned
    progress bars from the failed attempt are visually separated from the
    new attempt's bars, preventing garbled terminal output.

  • Increase Pull and Push max retries from 3 to 4 and update
    TestPullMaxRetriesExhausted to match (5 total attempts, error message
    says "after 4 retries").

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • In withRetries, printing a blank line to stdout on every retry may interfere with consumers that parse stdout; consider gating this behind a TTY/interactive check or restricting it to human-oriented output modes.
  • The _ = results assignment in Write looks like a workaround to silence unused-variable warnings; if results truly isn’t needed after the early checks, consider restructuring to avoid the no-op assignment (e.g., scoping or moving the earlier usage closer) to keep the intent clearer.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `withRetries`, printing a blank line to stdout on every retry may interfere with consumers that parse stdout; consider gating this behind a TTY/interactive check or restricting it to human-oriented output modes.
- The `_ = results` assignment in `Write` looks like a workaround to silence unused-variable warnings; if `results` truly isn’t needed after the early checks, consider restructuring to avoid the no-op assignment (e.g., scoping or moving the earlier usage closer) to keep the intent clearer.

## Individual Comments

### Comment 1
<location path="pkg/distribution/oci/remote/remote.go" line_range="200-204" />
<code_context>
-			if rs := GetRangeSuccess(req.Context()); rs != nil {
-				rs.Add(digest, requestedOffset)
-			}
+	// If we requested a Range, record success only when the server honoured it
+	// with 206 Partial Content. A 200 response means the server ignored the Range
+	// header and is sending the full file from byte 0; appending that stream to
+	// the existing partial file would produce a corrupt blob.
+	if requestedOffset > 0 && resp.StatusCode == http.StatusPartialContent {
+		if rs := GetRangeSuccess(req.Context()); rs != nil {
+			rs.Add(digest, requestedOffset)
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Consider validating the Content-Range header in addition to StatusPartialContent.

Relying only on 206 still trusts the server to honour the requested range. A misbehaving server could return 206 with a `Content-Range` that starts at a different offset (e.g., 0), and `WriteBlob` would still treat this as safe to append, risking blob corruption. Please also parse and validate `Content-Range` so that we only call `rs.Add` when the start offset matches `requestedOffset`.

Suggested implementation:

```golang
	// If we requested a Range, record success only when the server honoured it
	// with 206 Partial Content and a matching Content-Range. A 200 response means
	// the server ignored the Range header and is sending the full file from byte 0;
	// appending that stream to the existing partial file would produce a corrupt blob.
	if requestedOffset > 0 && resp.StatusCode == http.StatusPartialContent {
		// Content-Range: bytes START-END/TOTAL
		if cr := resp.Header.Get("Content-Range"); cr != "" {
			var unit string
			var start, end int64
			var total string // can be "*" or a number; we don't need to interpret it here

			// Fail closed: only treat this as a successful range when we can parse it
			// and the start offset matches requestedOffset.
			if _, err := fmt.Sscanf(cr, "%s %d-%d/%s", &unit, &start, &end, &total); err == nil &&
				unit == "bytes" && start == requestedOffset {
				if rs := GetRangeSuccess(req.Context()); rs != nil {
					rs.Add(digest, requestedOffset)
				}
			}
		}
	}

	return resp, nil

```

1. Ensure `fmt` is imported in `pkg/distribution/oci/remote/remote.go`:
   - If it is not already imported, add `fmt` to the import block.
2. If the project has a shared helper for parsing `Content-Range` elsewhere, you may want to replace the inline `fmt.Sscanf` logic with that helper for consistency.
</issue_to_address>

### Comment 2
<location path="pkg/distribution/internal/store/store_test.go" line_range="438-439" />
<code_context>
 		layerPath := filepath.Join(storePath, "blobs", parts[0], parts[1])
-		if _, err := os.Stat(layerPath); !errors.Is(err, os.ErrNotExist) {
-			t.Fatalf("expected layer blob %q to be cleaned up, stat error: %v", layerPath, err)
+		if _, err := os.Stat(layerPath); err != nil {
+			t.Fatalf("expected layer blob %q to be retained for future resume, stat error: %v", layerPath, err)
 		}
 	}
</code_context>
<issue_to_address>
**suggestion (testing):** Add a test case that verifies retained layer blobs are actually reused on a subsequent write/pull, not just that they exist on disk.

The updated assertion covers retention after rollback, but it doesn’t yet prove that retained blobs are reused. Please add or extend a test that performs a second `Write` of the same image after a failed first attempt, ensures the blob-fetch logic detects existing blobs (e.g. via `hasBlob`), and verifies that the second write does not recreate or re-download those blobs while leaving the final store state consistent. This will validate the end-to-end resume optimisation, not just persistence on disk.

Suggested implementation:

```golang
	// Layer blobs are content-addressed and are intentionally retained even
	// after a failed write. They may be reused by a subsequent pull of the
	// same or another model, and they allow the download to resume rather
	// than restart from byte 0. Only the manifest, config, and index are
	// rolled back to leave the store in a consistent (non-indexed) state.
	//
	// First, assert that all layer blobs still exist on disk after the failed
	// write and record their metadata so we can later prove they were not
	// re-written during a resumed write.
	layerInfos := make(map[string]os.FileInfo, len(diffIDs))
	for _, digestStr := range diffIDs {
		parts := strings.SplitN(digestStr, ":", 2)
		if len(parts) != 2 {
			t.Fatalf("unexpected diffID format: %q", digestStr)
		}
		layerPath := filepath.Join(storePath, "blobs", parts[0], parts[1])

		info, err := os.Stat(layerPath)
		if err != nil {
			t.Fatalf("expected layer blob %q to be retained for future resume, stat error: %v", layerPath, err)
		}

		layerInfos[layerPath] = info
	}

	// Perform a second write of the same content to simulate resuming the
	// interrupted pull. The store should detect the existing blobs and avoid
	// re-downloading or re-writing them, while leaving the store in a fully
	// indexed, consistent state on success.
	if err := store.Write(ctx, testRef, remoteDesc, fetcher); err != nil {
		t.Fatalf("second Write (resume) failed: %v", err)
	}

	// Verify that the resumed write re-used the retained layer blobs instead
	// of re-creating them by asserting size and mtime are unchanged.
	for layerPath, before := range layerInfos {
		after, err := os.Stat(layerPath)
		if err != nil {
			t.Fatalf("expected retained layer blob %q to still exist after resume, stat error: %v", layerPath, err)
		}

		if before.Size() != after.Size() {
			t.Fatalf("expected retained layer blob %q to be reused without size change (before=%d, after=%d)", layerPath, before.Size(), after.Size())
		}

		if !before.ModTime().Equal(after.ModTime()) {
			t.Fatalf("expected retained layer blob %q not to be rewritten on resume (mtime before=%v, after=%v)", layerPath, before.ModTime(), after.ModTime())
		}

```

The replacement above assumes the following names already exist in the surrounding test:
- `store` – the `Store` under test.
- `ctx` – the `context.Context` used for writes.
- `testRef` – the reference used for the original failed write (e.g. model/image ref).
- `remoteDesc` – the root descriptor passed into `Write` for the original failed write.
- `fetcher` – the `remotes.Fetcher` (or equivalent) used by `Write` to retrieve blobs.

To fully implement the “reuse” verification as described in your review:
1. Ensure this code is inside the same test that triggers the initial failed `Write`, *after* the failure has been asserted and `diffIDs` has been captured.
2. If the test currently uses different variable names for the second write (e.g. `desc` instead of `remoteDesc`, `ref` instead of `testRef`), adjust the `store.Write` call’s arguments accordingly.
3. If `store.Write` has a different signature (for example `Write(ctx, desc, fetcher)` or `Write(ctx, ref, desc, fetcher, opts...)`), update the call in the replacement block to match the actual signature and include any required options.
4. If the existing test uses a custom fetcher that can simulate a failure on the first write, reuse the same fetcher instance for the second `Write` so that the test validates the “hasBlob” shortcut path (i.e. fetcher should not have to re-upload or re-download layer blobs). Optionally, you can extend that fetcher to count how many times layer digests are fetched and assert that the second `Write` does not increment those counters for layer blobs.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread pkg/distribution/oci/remote/remote.go
Comment on lines +438 to +439
if _, err := os.Stat(layerPath); err != nil {
t.Fatalf("expected layer blob %q to be retained for future resume, stat error: %v", layerPath, err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add a test case that verifies retained layer blobs are actually reused on a subsequent write/pull, not just that they exist on disk.

The updated assertion covers retention after rollback, but it doesn’t yet prove that retained blobs are reused. Please add or extend a test that performs a second Write of the same image after a failed first attempt, ensures the blob-fetch logic detects existing blobs (e.g. via hasBlob), and verifies that the second write does not recreate or re-download those blobs while leaving the final store state consistent. This will validate the end-to-end resume optimisation, not just persistence on disk.

Suggested implementation:

	// Layer blobs are content-addressed and are intentionally retained even
	// after a failed write. They may be reused by a subsequent pull of the
	// same or another model, and they allow the download to resume rather
	// than restart from byte 0. Only the manifest, config, and index are
	// rolled back to leave the store in a consistent (non-indexed) state.
	//
	// First, assert that all layer blobs still exist on disk after the failed
	// write and record their metadata so we can later prove they were not
	// re-written during a resumed write.
	layerInfos := make(map[string]os.FileInfo, len(diffIDs))
	for _, digestStr := range diffIDs {
		parts := strings.SplitN(digestStr, ":", 2)
		if len(parts) != 2 {
			t.Fatalf("unexpected diffID format: %q", digestStr)
		}
		layerPath := filepath.Join(storePath, "blobs", parts[0], parts[1])

		info, err := os.Stat(layerPath)
		if err != nil {
			t.Fatalf("expected layer blob %q to be retained for future resume, stat error: %v", layerPath, err)
		}

		layerInfos[layerPath] = info
	}

	// Perform a second write of the same content to simulate resuming the
	// interrupted pull. The store should detect the existing blobs and avoid
	// re-downloading or re-writing them, while leaving the store in a fully
	// indexed, consistent state on success.
	if err := store.Write(ctx, testRef, remoteDesc, fetcher); err != nil {
		t.Fatalf("second Write (resume) failed: %v", err)
	}

	// Verify that the resumed write re-used the retained layer blobs instead
	// of re-creating them by asserting size and mtime are unchanged.
	for layerPath, before := range layerInfos {
		after, err := os.Stat(layerPath)
		if err != nil {
			t.Fatalf("expected retained layer blob %q to still exist after resume, stat error: %v", layerPath, err)
		}

		if before.Size() != after.Size() {
			t.Fatalf("expected retained layer blob %q to be reused without size change (before=%d, after=%d)", layerPath, before.Size(), after.Size())
		}

		if !before.ModTime().Equal(after.ModTime()) {
			t.Fatalf("expected retained layer blob %q not to be rewritten on resume (mtime before=%v, after=%v)", layerPath, before.ModTime(), after.ModTime())
		}

The replacement above assumes the following names already exist in the surrounding test:

  • store – the Store under test.
  • ctx – the context.Context used for writes.
  • testRef – the reference used for the original failed write (e.g. model/image ref).
  • remoteDesc – the root descriptor passed into Write for the original failed write.
  • fetcher – the remotes.Fetcher (or equivalent) used by Write to retrieve blobs.

To fully implement the “reuse” verification as described in your review:

  1. Ensure this code is inside the same test that triggers the initial failed Write, after the failure has been asserted and diffIDs has been captured.
  2. If the test currently uses different variable names for the second write (e.g. desc instead of remoteDesc, ref instead of testRef), adjust the store.Write call’s arguments accordingly.
  3. If store.Write has a different signature (for example Write(ctx, desc, fetcher) or Write(ctx, ref, desc, fetcher, opts...)), update the call in the replacement block to match the actual signature and include any required options.
  4. If the existing test uses a custom fetcher that can simulate a failure on the first write, reuse the same fetcher instance for the second Write so that the test validates the “hasBlob” shortcut path (i.e. fetcher should not have to re-upload or re-download layer blobs). Optionally, you can extend that fetcher to count how many times layer digests are fetched and assert that the second Write does not increment those counters for layer blobs.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the distribution store's efficiency and reliability by retaining layer blobs after failed operations to support resuming downloads and ensuring that only '206 Partial Content' responses validate range requests to prevent data corruption. Additionally, it improves the CLI retry experience by adding visual separation between attempts. Feedback was provided regarding a redundant blank identifier assignment in the store's write logic.

// re-download on the next attempt instead of resuming only the layer
// that was actually in progress. The manifest/config/index cleanup below
// is sufficient to leave the store in a consistent state.
_ = results // results used above for error checking; layer blobs are retained
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The statement _ = results is redundant and should be removed. In Go, the blank identifier is typically used to suppress unused variable errors; however, results is already utilized in the error-checking loop at line 368. Removing this line improves code clarity and adheres to the pragmatism principle by eliminating unnecessary code that does not contribute to the logic or maintainability of the system.

References
  1. Pragmatism — Does the solution match the complexity of the problem? Is the simplest viable approach being used? Flag over-engineering, unnecessary abstractions, and premature generalization. (link)

Copy link
Copy Markdown
Contributor

@ilopezluna ilopezluna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Sourcery and Gemini comments but nothing blocking, we can add the suggestion in a follow up PR

@ericcurtin ericcurtin force-pushed the fix-pull-resume-and-progress-bars branch from 3a50841 to 5959463 Compare April 20, 2026 14:51
- Only accept HTTP 206 with a matching Content-Range start offset as a
  successful Range response in rangeTransport; a 200 response means the
  server ignored the Range header and is sending from byte 0, so
  appending it to the partial file would corrupt the blob. A misbehaving
  server returning 206 with a different range is also rejected.

- Preserve .incomplete files on all read errors, not just context
  cancellation, so every kind of transient failure (network reset, stream
  error, etc.) can be resumed on the next attempt.

- Stop rolling back fully-downloaded layer blobs on Write failure.
  Layer blobs are content-addressed and immutable; keeping them lets a
  subsequent pull skip already-completed layers entirely instead of
  re-downloading them. The manifest, config, and index rollback still
  runs to leave the store in a consistent non-indexed state.

- Print a blank line to stdout before each retry so that orphaned
  progress bars from the failed attempt are visually separated from the
  new attempt's bars, preventing garbled terminal output.

- Increase Pull and Push max retries from 3 to 4 and update
  TestPullMaxRetriesExhausted to match (5 total attempts, error message
  says "after 4 retries").

Signed-off-by: Eric Curtin <eric.curtin@docker.com>
@ericcurtin ericcurtin force-pushed the fix-pull-resume-and-progress-bars branch from ae108b3 to 6e9c203 Compare April 20, 2026 16:14
@ericcurtin ericcurtin merged commit 4d9aca2 into main Apr 20, 2026
17 checks passed
@ericcurtin ericcurtin deleted the fix-pull-resume-and-progress-bars branch April 20, 2026 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants