Error Recovery
Failure modes, retry strategies, and recovery paths for each service boundary and operation type.
4+1 View: Process View
Error Recovery Overview
flowchart TB
subgraph Boundaries["Failure Boundaries"]
B1["Frontend ↔ Backend\n(HTTP + SignalR)"]
B2["Backend ↔ Processing Engine\n(HTTP with Polly)"]
B3["Backend ↔ MAST Proxy\n(HTTP)"]
B4["Backend ↔ MongoDB\n(MongoDB.Driver)"]
B5["MAST Proxy ↔ STScI\n(External HTTP)"]
end
subgraph Recovery["Recovery Strategies"]
R1["JWT refresh + retry"]
R2["Polly: 3 retries + circuit breaker"]
R3["Resumable downloads"]
R4["Connection pool + auto-reconnect"]
R5["User retry + resume"]
end
B1 --> R1
B2 --> R2
B3 --> R3
B4 --> R4
B5 --> R5
Boundary 1: Frontend ↔ Backend
Authentication Failures
| Trigger |
Response |
Recovery |
| Access token expired |
401 Unauthorized |
apiClient auto-refreshes token via /api/auth/refresh, retries original request |
| Refresh token expired |
401 on refresh |
Redirect to login page, clear local auth state |
| Invalid credentials |
401 on login |
Show error, user retries |
SignalR Disconnection
| Trigger |
Response |
Recovery |
| Network interruption |
WebSocket close |
SignalR auto-reconnect (built-in, exponential backoff) |
| Token expired during connection |
Auth failure |
Reconnect with fresh token |
| Server restart |
Connection drop |
Auto-reconnect + re-subscribe to active jobs |
| Missed updates during disconnect |
Stale UI |
JobSnapshot sent on re-subscribe catches up state |
HTTP Request Failures
| Trigger |
Response |
Recovery |
| Network timeout |
Fetch error |
UI shows error toast, user retries |
| 429 Rate Limited |
Retry-After header |
UI backs off, can retry after delay |
| 503 Service Unavailable |
Error response |
UI shows "service busy" message |
Boundary 2: Backend ↔ Processing Engine
Polly Resilience Pipeline (Composite & Mosaic)
flowchart LR
Request["HTTP Request"] --> Retry["Retry Policy\n3 attempts\n2s exponential backoff"]
Retry --> CB["Circuit Breaker\n61 min sampling"]
CB --> Timeout1["Attempt Timeout\n30 minutes"]
Timeout1 --> Timeout2["Total Timeout\n60 minutes"]
Timeout2 --> Response["Response"]
Retry Conditions:
| Error |
Retried? |
Reason |
HttpRequestException |
Yes |
Transient network error |
TimeoutRejectedException |
Yes |
Temporary overload |
| HTTP 502 Bad Gateway |
Yes |
Proxy/container restart |
| HTTP 503 Service Unavailable |
Yes |
Temporary unavailability |
| HTTP 504 Gateway Timeout |
Yes |
Slow response |
| HTTP 500 Internal Server Error |
No |
Application bug — retrying won't help |
| HTTP 400 Bad Request |
No |
Client error — fix the request |
| HTTP 413 Payload Too Large |
No |
Input too big — won't shrink on retry |
Error Translation
The backend translates Processing Engine errors into user-friendly messages via ProcessingErrorMessages:
HttpRequestException + ServiceUnavailable → "Processing engine is temporarily unavailable"
HttpRequestException + SocketException → "Processing engine not reachable"
TaskCanceledException → "Processing timed out"
KeyNotFoundException → Original message preserved (e.g., "File not found")
Default → "An unexpected error occurred"
Job Failure Pattern
sequenceDiagram
participant BG as BackgroundService
participant PE as Processing Engine
participant JT as JobTracker
participant Hub as SignalR
BG->>JT: UpdateProgress(10%, "generating")
BG->>PE: POST /composite/generate-nchannel
alt Success
PE-->>BG: 200 + image bytes
BG->>JT: CompleteBlobJobAsync()
JT->>Hub: JobCompleted
else Transient Error (502/503)
PE-->>BG: Error
Note over BG: Polly retries (up to 3x)
PE-->>BG: 200 + image bytes
BG->>JT: CompleteBlobJobAsync()
else Permanent Error
PE-->>BG: 500 / timeout
BG->>JT: FailJobAsync("Processing engine error")
JT->>Hub: JobFailed
end
Boundary 3: Backend ↔ MAST Proxy (Imports)
Download Failure Recovery
| Failure |
Detection |
Recovery |
| Network interruption mid-download |
HttpRequestException |
Job marked as resumable; user can restart |
| STScI server 5xx |
HTTP status code |
Job fails; user retries (no automatic retry for downloads) |
| Disk full |
IOException |
Job fails with storage error; no partial records saved |
| Download timeout (5 min) |
TaskCanceledException |
Job fails; user can retry |
| User cancellation |
CancelRequested flag |
CancellationTokenSource cancels HTTP request; job marked cancelled |
Resumable Import Pattern
sequenceDiagram
participant User
participant BE as Backend
participant MP as MAST Proxy
participant STScI
User->>BE: POST /api/mast/import
BE->>MP: POST /mast/chunked-download
MP->>STScI: GET (streaming)
Note over MP,STScI: Network interruption
MP-->>BE: Error (partial download)
BE->>BE: Mark job as "failed" + "resumable"
BE->>BE: Record: downloadedBytes, fileProgress
User->>BE: GET /api/jobs (sees resumable job)
User->>BE: POST /api/mast/import (retry)
BE->>MP: POST /mast/chunked-download (with byte offset)
MP->>STScI: GET (Range header for resume)
STScI-->>MP: 206 Partial Content
MP-->>BE: Complete download
BE->>BE: Mark job completed
Import Data Integrity
- Atomic record creation: FITS metadata extracted and validated before MongoDB insert
- No partial records: If import fails mid-file, no
JwstData document is created for that file
- Checksum verification: File checksums computed on download for integrity validation
- Cleanup on failure: Partially downloaded files cleaned up; storage and DB stay consistent
Boundary 4: Backend ↔ MongoDB
Connection Resilience
- Connection pool: MongoDB.Driver manages connection pool automatically
- Auto-reconnect: Driver handles transient disconnections transparently
- No custom retry: Relies on driver-level retry (retryable writes enabled by default in MongoDB 4.2+)
Failure Scenarios
| Failure |
Impact |
Recovery |
| MongoDB unreachable at startup |
Backend won't start |
Docker depends_on ensures MongoDB starts first; health check would catch |
| MongoDB crash during operation |
Write failure |
Operation fails; user retries; MongoDB restarts via Docker |
| Disk full (MongoDB) |
Write failure |
Docker container may crash-loop (exit 133); fix: docker image prune -f && docker builder prune -f |
Dual-Write Consistency (JobTracker)
The JobTracker uses a write-through cache:
1. Update ConcurrentDictionary (in-memory) ← immediate
2. Persist to MongoDB (async) ← may fail
3. Push via SignalR (async) ← may fail
- Memory is authoritative for active jobs (fast reads)
- MongoDB is authoritative for job recovery after restart
- If MongoDB write fails: Job progress still visible in-memory; logged as warning; job can still complete
- On restart: In-memory cache lost; running jobs appear as stale in MongoDB (need manual cleanup or TTL expiry)
Boundary 5: MAST Proxy ↔ STScI External
MAST API Resilience
| Failure |
Frequency |
Handling |
| MAST API timeout |
Occasional |
5-minute timeout; fail and surface to user |
| MAST API 5xx |
Rare |
Fail with "MAST service unavailable" |
| MAST API rate limit |
Possible |
Backend rate limits MAST calls to 30/min to stay well under MAST limits |
| MAST API schema change |
Very rare |
Would cause parsing errors; manual code update needed |
| STScI maintenance window |
Periodic |
Discovery/import unavailable; local features unaffected |
Isolation Principle
MAST failures are isolated to MAST-dependent features:
| Feature |
MAST Dependency |
Impact of MAST Outage |
| Discovery (featured targets) |
None (curated list) |
No impact |
| MAST search |
Direct |
Unavailable |
| Data import |
Direct |
Unavailable |
| Recipe suggestions |
Indirect (needs MAST data) |
Limited (works with already-imported data) |
| Compositing |
None |
No impact |
| Mosaic |
None |
No impact |
| Analysis |
None |
No impact |
| Data library |
None |
No impact |
Container Restart Recovery
Processing Engine Restart
| State |
Detection |
Recovery |
| No active job |
Health check passes after restart |
No impact |
| Active HTTP request from backend |
HttpRequestException on backend |
Polly retries (up to 3x); may succeed after container restarts |
| Active job in queue |
Backend detects HTTP failure |
Job marked failed; user can re-submit |
Backend Restart
| State |
Detection |
Recovery |
| In-memory job cache |
Lost on restart |
Stale jobs in MongoDB; TTL cleans up in 30 min |
| Bounded channel queues |
Lost on restart |
Queued jobs lost; users must re-submit |
| SignalR connections |
Dropped |
Clients auto-reconnect; re-subscribe |
| Active HTTP calls |
Interrupted |
Processing Engine may complete work that no one collects |
MongoDB Restart
| State |
Detection |
Recovery |
| Clean restart |
Brief unavailability |
Backend operations fail during restart; auto-recover after |
| Crash (exit 133 = OOM) |
Container restart loop |
Likely disk space issue; prune Docker images |
| Data corruption |
Startup failure |
Restore from Docker volume backup |
Health Check Architecture
flowchart LR
Docker["Docker\nHealth Checks"] -->|HTTP GET /health| PE["Processing Engine\n(10s interval, 3 retries)"]
Docker -->|HTTP GET /health| MP["MAST Proxy\n(10s interval, 3 retries)"]
BE["Backend\n/api/health"] -->|HTTP GET /health| PE
BE -->|HTTP GET /health| MP
BE -->|"Status: Degraded\n(not Unhealthy)"| Report["Health Report"]
- Processing Engine or MAST Proxy down → backend reports Degraded (not Unhealthy)
- Backend continues serving local operations (data library, auth)
- Docker restarts containers after 3 consecutive health check failures (30 seconds)
Back to Architecture Overview