Error Recovery

Failure modes, retry strategies, and recovery paths for each service boundary and operation type.

4+1 View: Process View

Error Recovery Overview

flowchart TB
    subgraph Boundaries["Failure Boundaries"]
        B1["Frontend ↔ Backend\n(HTTP + SignalR)"]
        B2["Backend ↔ Processing Engine\n(HTTP with Polly)"]
        B3["Backend ↔ MAST Proxy\n(HTTP)"]
        B4["Backend ↔ MongoDB\n(MongoDB.Driver)"]
        B5["MAST Proxy ↔ STScI\n(External HTTP)"]
    end

    subgraph Recovery["Recovery Strategies"]
        R1["JWT refresh + retry"]
        R2["Polly: 3 retries + circuit breaker"]
        R3["Resumable downloads"]
        R4["Connection pool + auto-reconnect"]
        R5["User retry + resume"]
    end

    B1 --> R1
    B2 --> R2
    B3 --> R3
    B4 --> R4
    B5 --> R5

Boundary 1: Frontend ↔ Backend

Authentication Failures

Trigger	Response	Recovery
Access token expired	401 Unauthorized	`apiClient` auto-refreshes token via `/api/auth/refresh`, retries original request
Refresh token expired	401 on refresh	Redirect to login page, clear local auth state
Invalid credentials	401 on login	Show error, user retries

SignalR Disconnection

Trigger	Response	Recovery
Network interruption	WebSocket close	SignalR auto-reconnect (built-in, exponential backoff)
Token expired during connection	Auth failure	Reconnect with fresh token
Server restart	Connection drop	Auto-reconnect + re-subscribe to active jobs
Missed updates during disconnect	Stale UI	`JobSnapshot` sent on re-subscribe catches up state

HTTP Request Failures

Trigger	Response	Recovery
Network timeout	Fetch error	UI shows error toast, user retries
429 Rate Limited	`Retry-After` header	UI backs off, can retry after delay
503 Service Unavailable	Error response	UI shows "service busy" message

Boundary 2: Backend ↔ Processing Engine

Polly Resilience Pipeline (Composite & Mosaic)

flowchart LR
    Request["HTTP Request"] --> Retry["Retry Policy\n3 attempts\n2s exponential backoff"]
    Retry --> CB["Circuit Breaker\n61 min sampling"]
    CB --> Timeout1["Attempt Timeout\n30 minutes"]
    Timeout1 --> Timeout2["Total Timeout\n60 minutes"]
    Timeout2 --> Response["Response"]

Retry Conditions:

Error	Retried?	Reason
`HttpRequestException`	Yes	Transient network error
`TimeoutRejectedException`	Yes	Temporary overload
HTTP 502 Bad Gateway	Yes	Proxy/container restart
HTTP 503 Service Unavailable	Yes	Temporary unavailability
HTTP 504 Gateway Timeout	Yes	Slow response
HTTP 500 Internal Server Error	No	Application bug — retrying won't help
HTTP 400 Bad Request	No	Client error — fix the request
HTTP 413 Payload Too Large	No	Input too big — won't shrink on retry

Error Translation

The backend translates Processing Engine errors into user-friendly messages via ProcessingErrorMessages:

HttpRequestException + ServiceUnavailable → "Processing engine is temporarily unavailable"
HttpRequestException + SocketException    → "Processing engine not reachable"
TaskCanceledException                     → "Processing timed out"
KeyNotFoundException                      → Original message preserved (e.g., "File not found")
Default                                   → "An unexpected error occurred"

Job Failure Pattern

sequenceDiagram
    participant BG as BackgroundService
    participant PE as Processing Engine
    participant JT as JobTracker
    participant Hub as SignalR

    BG->>JT: UpdateProgress(10%, "generating")
    BG->>PE: POST /composite/generate-nchannel

    alt Success
        PE-->>BG: 200 + image bytes
        BG->>JT: CompleteBlobJobAsync()
        JT->>Hub: JobCompleted
    else Transient Error (502/503)
        PE-->>BG: Error
        Note over BG: Polly retries (up to 3x)
        PE-->>BG: 200 + image bytes
        BG->>JT: CompleteBlobJobAsync()
    else Permanent Error
        PE-->>BG: 500 / timeout
        BG->>JT: FailJobAsync("Processing engine error")
        JT->>Hub: JobFailed
    end

Boundary 3: Backend ↔ MAST Proxy (Imports)

Download Failure Recovery

Failure	Detection	Recovery
Network interruption mid-download	`HttpRequestException`	Job marked as resumable; user can restart
STScI server 5xx	HTTP status code	Job fails; user retries (no automatic retry for downloads)
Disk full	`IOException`	Job fails with storage error; no partial records saved
Download timeout (5 min)	`TaskCanceledException`	Job fails; user can retry
User cancellation	`CancelRequested` flag	`CancellationTokenSource` cancels HTTP request; job marked cancelled

Resumable Import Pattern

sequenceDiagram
    participant User
    participant BE as Backend
    participant MP as MAST Proxy
    participant STScI

    User->>BE: POST /api/mast/import
    BE->>MP: POST /mast/chunked-download
    MP->>STScI: GET (streaming)

    Note over MP,STScI: Network interruption

    MP-->>BE: Error (partial download)
    BE->>BE: Mark job as "failed" + "resumable"
    BE->>BE: Record: downloadedBytes, fileProgress

    User->>BE: GET /api/jobs (sees resumable job)
    User->>BE: POST /api/mast/import (retry)
    BE->>MP: POST /mast/chunked-download (with byte offset)
    MP->>STScI: GET (Range header for resume)
    STScI-->>MP: 206 Partial Content
    MP-->>BE: Complete download
    BE->>BE: Mark job completed

Import Data Integrity

Atomic record creation: FITS metadata extracted and validated before MongoDB insert
No partial records: If import fails mid-file, no JwstData document is created for that file
Checksum verification: File checksums computed on download for integrity validation
Cleanup on failure: Partially downloaded files cleaned up; storage and DB stay consistent

Boundary 4: Backend ↔ MongoDB

Connection Resilience

Connection pool: MongoDB.Driver manages connection pool automatically
Auto-reconnect: Driver handles transient disconnections transparently
No custom retry: Relies on driver-level retry (retryable writes enabled by default in MongoDB 4.2+)

Failure Scenarios

Failure	Impact	Recovery
MongoDB unreachable at startup	Backend won't start	Docker `depends_on` ensures MongoDB starts first; health check would catch
MongoDB crash during operation	Write failure	Operation fails; user retries; MongoDB restarts via Docker
Disk full (MongoDB)	Write failure	Docker container may crash-loop (exit 133); fix: `docker image prune -f && docker builder prune -f`

Dual-Write Consistency (JobTracker)

The JobTracker uses a write-through cache:

1. Update ConcurrentDictionary (in-memory) ← immediate
2. Persist to MongoDB (async)              ← may fail
3. Push via SignalR (async)                ← may fail

Memory is authoritative for active jobs (fast reads)
MongoDB is authoritative for job recovery after restart
If MongoDB write fails: Job progress still visible in-memory; logged as warning; job can still complete
On restart: In-memory cache lost; running jobs appear as stale in MongoDB (need manual cleanup or TTL expiry)

Boundary 5: MAST Proxy ↔ STScI External

MAST API Resilience

Failure	Frequency	Handling
MAST API timeout	Occasional	5-minute timeout; fail and surface to user
MAST API 5xx	Rare	Fail with "MAST service unavailable"
MAST API rate limit	Possible	Backend rate limits MAST calls to 30/min to stay well under MAST limits
MAST API schema change	Very rare	Would cause parsing errors; manual code update needed
STScI maintenance window	Periodic	Discovery/import unavailable; local features unaffected

Isolation Principle

MAST failures are isolated to MAST-dependent features:

Feature	MAST Dependency	Impact of MAST Outage
Discovery (featured targets)	None (curated list)	No impact
MAST search	Direct	Unavailable
Data import	Direct	Unavailable
Recipe suggestions	Indirect (needs MAST data)	Limited (works with already-imported data)
Compositing	None	No impact
Mosaic	None	No impact
Analysis	None	No impact
Data library	None	No impact

Container Restart Recovery

Processing Engine Restart

State	Detection	Recovery
No active job	Health check passes after restart	No impact
Active HTTP request from backend	`HttpRequestException` on backend	Polly retries (up to 3x); may succeed after container restarts
Active job in queue	Backend detects HTTP failure	Job marked failed; user can re-submit

Backend Restart

State	Detection	Recovery
In-memory job cache	Lost on restart	Stale jobs in MongoDB; TTL cleans up in 30 min
Bounded channel queues	Lost on restart	Queued jobs lost; users must re-submit
SignalR connections	Dropped	Clients auto-reconnect; re-subscribe
Active HTTP calls	Interrupted	Processing Engine may complete work that no one collects

MongoDB Restart

State	Detection	Recovery
Clean restart	Brief unavailability	Backend operations fail during restart; auto-recover after
Crash (exit 133 = OOM)	Container restart loop	Likely disk space issue; prune Docker images
Data corruption	Startup failure	Restore from Docker volume backup

Health Check Architecture

flowchart LR
    Docker["Docker\nHealth Checks"] -->|HTTP GET /health| PE["Processing Engine\n(10s interval, 3 retries)"]
    Docker -->|HTTP GET /health| MP["MAST Proxy\n(10s interval, 3 retries)"]
    BE["Backend\n/api/health"] -->|HTTP GET /health| PE
    BE -->|HTTP GET /health| MP
    BE -->|"Status: Degraded\n(not Unhealthy)"| Report["Health Report"]

Processing Engine or MAST Proxy down → backend reports Degraded (not Unhealthy)
Backend continues serving local operations (data library, auth)
Docker restarts containers after 3 consecutive health check failures (30 seconds)

Back to Architecture Overview