fix: improve pull resume reliability and progress bar display on retry#873
fix: improve pull resume reliability and progress bar display on retry#873ericcurtin merged 1 commit intomainfrom
Conversation
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- In
withRetries, printing a blank line to stdout on every retry may interfere with consumers that parse stdout; consider gating this behind a TTY/interactive check or restricting it to human-oriented output modes. - The
_ = resultsassignment inWritelooks like a workaround to silence unused-variable warnings; ifresultstruly isn’t needed after the early checks, consider restructuring to avoid the no-op assignment (e.g., scoping or moving the earlier usage closer) to keep the intent clearer.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `withRetries`, printing a blank line to stdout on every retry may interfere with consumers that parse stdout; consider gating this behind a TTY/interactive check or restricting it to human-oriented output modes.
- The `_ = results` assignment in `Write` looks like a workaround to silence unused-variable warnings; if `results` truly isn’t needed after the early checks, consider restructuring to avoid the no-op assignment (e.g., scoping or moving the earlier usage closer) to keep the intent clearer.
## Individual Comments
### Comment 1
<location path="pkg/distribution/oci/remote/remote.go" line_range="200-204" />
<code_context>
- if rs := GetRangeSuccess(req.Context()); rs != nil {
- rs.Add(digest, requestedOffset)
- }
+ // If we requested a Range, record success only when the server honoured it
+ // with 206 Partial Content. A 200 response means the server ignored the Range
+ // header and is sending the full file from byte 0; appending that stream to
+ // the existing partial file would produce a corrupt blob.
+ if requestedOffset > 0 && resp.StatusCode == http.StatusPartialContent {
+ if rs := GetRangeSuccess(req.Context()); rs != nil {
+ rs.Add(digest, requestedOffset)
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Consider validating the Content-Range header in addition to StatusPartialContent.
Relying only on 206 still trusts the server to honour the requested range. A misbehaving server could return 206 with a `Content-Range` that starts at a different offset (e.g., 0), and `WriteBlob` would still treat this as safe to append, risking blob corruption. Please also parse and validate `Content-Range` so that we only call `rs.Add` when the start offset matches `requestedOffset`.
Suggested implementation:
```golang
// If we requested a Range, record success only when the server honoured it
// with 206 Partial Content and a matching Content-Range. A 200 response means
// the server ignored the Range header and is sending the full file from byte 0;
// appending that stream to the existing partial file would produce a corrupt blob.
if requestedOffset > 0 && resp.StatusCode == http.StatusPartialContent {
// Content-Range: bytes START-END/TOTAL
if cr := resp.Header.Get("Content-Range"); cr != "" {
var unit string
var start, end int64
var total string // can be "*" or a number; we don't need to interpret it here
// Fail closed: only treat this as a successful range when we can parse it
// and the start offset matches requestedOffset.
if _, err := fmt.Sscanf(cr, "%s %d-%d/%s", &unit, &start, &end, &total); err == nil &&
unit == "bytes" && start == requestedOffset {
if rs := GetRangeSuccess(req.Context()); rs != nil {
rs.Add(digest, requestedOffset)
}
}
}
}
return resp, nil
```
1. Ensure `fmt` is imported in `pkg/distribution/oci/remote/remote.go`:
- If it is not already imported, add `fmt` to the import block.
2. If the project has a shared helper for parsing `Content-Range` elsewhere, you may want to replace the inline `fmt.Sscanf` logic with that helper for consistency.
</issue_to_address>
### Comment 2
<location path="pkg/distribution/internal/store/store_test.go" line_range="438-439" />
<code_context>
layerPath := filepath.Join(storePath, "blobs", parts[0], parts[1])
- if _, err := os.Stat(layerPath); !errors.Is(err, os.ErrNotExist) {
- t.Fatalf("expected layer blob %q to be cleaned up, stat error: %v", layerPath, err)
+ if _, err := os.Stat(layerPath); err != nil {
+ t.Fatalf("expected layer blob %q to be retained for future resume, stat error: %v", layerPath, err)
}
}
</code_context>
<issue_to_address>
**suggestion (testing):** Add a test case that verifies retained layer blobs are actually reused on a subsequent write/pull, not just that they exist on disk.
The updated assertion covers retention after rollback, but it doesn’t yet prove that retained blobs are reused. Please add or extend a test that performs a second `Write` of the same image after a failed first attempt, ensures the blob-fetch logic detects existing blobs (e.g. via `hasBlob`), and verifies that the second write does not recreate or re-download those blobs while leaving the final store state consistent. This will validate the end-to-end resume optimisation, not just persistence on disk.
Suggested implementation:
```golang
// Layer blobs are content-addressed and are intentionally retained even
// after a failed write. They may be reused by a subsequent pull of the
// same or another model, and they allow the download to resume rather
// than restart from byte 0. Only the manifest, config, and index are
// rolled back to leave the store in a consistent (non-indexed) state.
//
// First, assert that all layer blobs still exist on disk after the failed
// write and record their metadata so we can later prove they were not
// re-written during a resumed write.
layerInfos := make(map[string]os.FileInfo, len(diffIDs))
for _, digestStr := range diffIDs {
parts := strings.SplitN(digestStr, ":", 2)
if len(parts) != 2 {
t.Fatalf("unexpected diffID format: %q", digestStr)
}
layerPath := filepath.Join(storePath, "blobs", parts[0], parts[1])
info, err := os.Stat(layerPath)
if err != nil {
t.Fatalf("expected layer blob %q to be retained for future resume, stat error: %v", layerPath, err)
}
layerInfos[layerPath] = info
}
// Perform a second write of the same content to simulate resuming the
// interrupted pull. The store should detect the existing blobs and avoid
// re-downloading or re-writing them, while leaving the store in a fully
// indexed, consistent state on success.
if err := store.Write(ctx, testRef, remoteDesc, fetcher); err != nil {
t.Fatalf("second Write (resume) failed: %v", err)
}
// Verify that the resumed write re-used the retained layer blobs instead
// of re-creating them by asserting size and mtime are unchanged.
for layerPath, before := range layerInfos {
after, err := os.Stat(layerPath)
if err != nil {
t.Fatalf("expected retained layer blob %q to still exist after resume, stat error: %v", layerPath, err)
}
if before.Size() != after.Size() {
t.Fatalf("expected retained layer blob %q to be reused without size change (before=%d, after=%d)", layerPath, before.Size(), after.Size())
}
if !before.ModTime().Equal(after.ModTime()) {
t.Fatalf("expected retained layer blob %q not to be rewritten on resume (mtime before=%v, after=%v)", layerPath, before.ModTime(), after.ModTime())
}
```
The replacement above assumes the following names already exist in the surrounding test:
- `store` – the `Store` under test.
- `ctx` – the `context.Context` used for writes.
- `testRef` – the reference used for the original failed write (e.g. model/image ref).
- `remoteDesc` – the root descriptor passed into `Write` for the original failed write.
- `fetcher` – the `remotes.Fetcher` (or equivalent) used by `Write` to retrieve blobs.
To fully implement the “reuse” verification as described in your review:
1. Ensure this code is inside the same test that triggers the initial failed `Write`, *after* the failure has been asserted and `diffIDs` has been captured.
2. If the test currently uses different variable names for the second write (e.g. `desc` instead of `remoteDesc`, `ref` instead of `testRef`), adjust the `store.Write` call’s arguments accordingly.
3. If `store.Write` has a different signature (for example `Write(ctx, desc, fetcher)` or `Write(ctx, ref, desc, fetcher, opts...)`), update the call in the replacement block to match the actual signature and include any required options.
4. If the existing test uses a custom fetcher that can simulate a failure on the first write, reuse the same fetcher instance for the second `Write` so that the test validates the “hasBlob” shortcut path (i.e. fetcher should not have to re-upload or re-download layer blobs). Optionally, you can extend that fetcher to count how many times layer digests are fetched and assert that the second `Write` does not increment those counters for layer blobs.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| if _, err := os.Stat(layerPath); err != nil { | ||
| t.Fatalf("expected layer blob %q to be retained for future resume, stat error: %v", layerPath, err) |
There was a problem hiding this comment.
suggestion (testing): Add a test case that verifies retained layer blobs are actually reused on a subsequent write/pull, not just that they exist on disk.
The updated assertion covers retention after rollback, but it doesn’t yet prove that retained blobs are reused. Please add or extend a test that performs a second Write of the same image after a failed first attempt, ensures the blob-fetch logic detects existing blobs (e.g. via hasBlob), and verifies that the second write does not recreate or re-download those blobs while leaving the final store state consistent. This will validate the end-to-end resume optimisation, not just persistence on disk.
Suggested implementation:
// Layer blobs are content-addressed and are intentionally retained even
// after a failed write. They may be reused by a subsequent pull of the
// same or another model, and they allow the download to resume rather
// than restart from byte 0. Only the manifest, config, and index are
// rolled back to leave the store in a consistent (non-indexed) state.
//
// First, assert that all layer blobs still exist on disk after the failed
// write and record their metadata so we can later prove they were not
// re-written during a resumed write.
layerInfos := make(map[string]os.FileInfo, len(diffIDs))
for _, digestStr := range diffIDs {
parts := strings.SplitN(digestStr, ":", 2)
if len(parts) != 2 {
t.Fatalf("unexpected diffID format: %q", digestStr)
}
layerPath := filepath.Join(storePath, "blobs", parts[0], parts[1])
info, err := os.Stat(layerPath)
if err != nil {
t.Fatalf("expected layer blob %q to be retained for future resume, stat error: %v", layerPath, err)
}
layerInfos[layerPath] = info
}
// Perform a second write of the same content to simulate resuming the
// interrupted pull. The store should detect the existing blobs and avoid
// re-downloading or re-writing them, while leaving the store in a fully
// indexed, consistent state on success.
if err := store.Write(ctx, testRef, remoteDesc, fetcher); err != nil {
t.Fatalf("second Write (resume) failed: %v", err)
}
// Verify that the resumed write re-used the retained layer blobs instead
// of re-creating them by asserting size and mtime are unchanged.
for layerPath, before := range layerInfos {
after, err := os.Stat(layerPath)
if err != nil {
t.Fatalf("expected retained layer blob %q to still exist after resume, stat error: %v", layerPath, err)
}
if before.Size() != after.Size() {
t.Fatalf("expected retained layer blob %q to be reused without size change (before=%d, after=%d)", layerPath, before.Size(), after.Size())
}
if !before.ModTime().Equal(after.ModTime()) {
t.Fatalf("expected retained layer blob %q not to be rewritten on resume (mtime before=%v, after=%v)", layerPath, before.ModTime(), after.ModTime())
}The replacement above assumes the following names already exist in the surrounding test:
store– theStoreunder test.ctx– thecontext.Contextused for writes.testRef– the reference used for the original failed write (e.g. model/image ref).remoteDesc– the root descriptor passed intoWritefor the original failed write.fetcher– theremotes.Fetcher(or equivalent) used byWriteto retrieve blobs.
To fully implement the “reuse” verification as described in your review:
- Ensure this code is inside the same test that triggers the initial failed
Write, after the failure has been asserted anddiffIDshas been captured. - If the test currently uses different variable names for the second write (e.g.
descinstead ofremoteDesc,refinstead oftestRef), adjust thestore.Writecall’s arguments accordingly. - If
store.Writehas a different signature (for exampleWrite(ctx, desc, fetcher)orWrite(ctx, ref, desc, fetcher, opts...)), update the call in the replacement block to match the actual signature and include any required options. - If the existing test uses a custom fetcher that can simulate a failure on the first write, reuse the same fetcher instance for the second
Writeso that the test validates the “hasBlob” shortcut path (i.e. fetcher should not have to re-upload or re-download layer blobs). Optionally, you can extend that fetcher to count how many times layer digests are fetched and assert that the secondWritedoes not increment those counters for layer blobs.
There was a problem hiding this comment.
Code Review
This pull request enhances the distribution store's efficiency and reliability by retaining layer blobs after failed operations to support resuming downloads and ensuring that only '206 Partial Content' responses validate range requests to prevent data corruption. Additionally, it improves the CLI retry experience by adding visual separation between attempts. Feedback was provided regarding a redundant blank identifier assignment in the store's write logic.
| // re-download on the next attempt instead of resuming only the layer | ||
| // that was actually in progress. The manifest/config/index cleanup below | ||
| // is sufficient to leave the store in a consistent state. | ||
| _ = results // results used above for error checking; layer blobs are retained |
There was a problem hiding this comment.
The statement _ = results is redundant and should be removed. In Go, the blank identifier is typically used to suppress unused variable errors; however, results is already utilized in the error-checking loop at line 368. Removing this line improves code clarity and adheres to the pragmatism principle by eliminating unnecessary code that does not contribute to the logic or maintainability of the system.
References
- Pragmatism — Does the solution match the complexity of the problem? Is the simplest viable approach being used? Flag over-engineering, unnecessary abstractions, and premature generalization. (link)
ilopezluna
left a comment
There was a problem hiding this comment.
I agree with Sourcery and Gemini comments but nothing blocking, we can add the suggestion in a follow up PR
3a50841 to
5959463
Compare
- Only accept HTTP 206 with a matching Content-Range start offset as a successful Range response in rangeTransport; a 200 response means the server ignored the Range header and is sending from byte 0, so appending it to the partial file would corrupt the blob. A misbehaving server returning 206 with a different range is also rejected. - Preserve .incomplete files on all read errors, not just context cancellation, so every kind of transient failure (network reset, stream error, etc.) can be resumed on the next attempt. - Stop rolling back fully-downloaded layer blobs on Write failure. Layer blobs are content-addressed and immutable; keeping them lets a subsequent pull skip already-completed layers entirely instead of re-downloading them. The manifest, config, and index rollback still runs to leave the store in a consistent non-indexed state. - Print a blank line to stdout before each retry so that orphaned progress bars from the failed attempt are visually separated from the new attempt's bars, preventing garbled terminal output. - Increase Pull and Push max retries from 3 to 4 and update TestPullMaxRetriesExhausted to match (5 total attempts, error message says "after 4 retries"). Signed-off-by: Eric Curtin <eric.curtin@docker.com>
ae108b3 to
6e9c203
Compare
Only accept HTTP 206 with a matching Content-Range start offset as a
successful Range response in rangeTransport; a 200 response means the
server ignored the Range header and is sending from byte 0, so
appending it to the partial file would corrupt the blob. A misbehaving
server returning 206 with a different range is also rejected.
Preserve .incomplete files on all read errors, not just context
cancellation, so every kind of transient failure (network reset, stream
error, etc.) can be resumed on the next attempt.
Stop rolling back fully-downloaded layer blobs on Write failure.
Layer blobs are content-addressed and immutable; keeping them lets a
subsequent pull skip already-completed layers entirely instead of
re-downloading them. The manifest, config, and index rollback still
runs to leave the store in a consistent non-indexed state.
Print a blank line to stdout before each retry so that orphaned
progress bars from the failed attempt are visually separated from the
new attempt's bars, preventing garbled terminal output.
Increase Pull and Push max retries from 3 to 4 and update
TestPullMaxRetriesExhausted to match (5 total attempts, error message
says "after 4 retries").