Cloud Storage Evaluation for JWST Data Analysis

Context

This document evaluates cloud storage options for the JWST Data Analysis platform, which currently uses local filesystem storage at /app/data/mast/ with MongoDB for metadata. The goal is to identify the best storage backend for cloud deployment.

Current Data Characteristics

Property	Value
Primary format	FITS files (2D images, 3D cubes, multi-HDU)
Typical file size	1-500 MB per FITS file
Access pattern	Write-once, read-many (download from MAST, then analyze)
Read style	Memory-mapped, HDU random access, byte-range reads
Write style	Chunked downloads (5 MB chunks), atomic rename on completion
Metadata store	MongoDB (separate from file storage)
Processing model	Sequential pipeline with optional intermediate results

JWST Data Lifecycle

Raw (_uncal) --> Calibrated (_rate, _cal) --> Combined (_i2d, _s2d, _x1d)
  [cold]            [warm]                       [hot]

Files move from frequently accessed during processing to rarely accessed once analysis is complete. This lifecycle maps naturally to tiered storage.

Options Evaluated

1. Amazon S3

Fit: Strong

Aspect	Assessment
Ecosystem alignment	STScI hosts JWST public data on S3 (`s3://stpubdata/`). Using S3 enables direct cloud-to-cloud access, potentially eliminating the MAST download pipeline entirely for public data.
Access pattern support	S3 byte-range GET requests support HDU-level random access. `astropy.io.fits` works with `fsspec`/`s3fs` for lazy access without full file downloads.
Tiered storage	S3 Intelligent-Tiering, Glacier, and lifecycle rules map directly to the raw-to-processed data lifecycle.
Cost (us-east-1)	~$0.023/GB/month (Standard), ~$0.0125/GB/month (Infrequent Access), ~$0.004/GB/month (Glacier)
Python SDK	`boto3` is mature and widely used. `s3fs` provides filesystem-like interface.
.NET SDK	`AWSSDK.S3` — well-supported, but not as natural as Azure for .NET.

Key advantage: Direct access to STScI's public JWST data on S3 without HTTP download overhead.

2. Azure Blob Storage

Fit: Strong

Aspect	Assessment
Tiering	Hot/Cool/Cold/Archive tiers with automatic tiering policies. Good match for data lifecycle.
.NET integration	First-class SDK (`Azure.Storage.Blobs`). Natural fit for the .NET backend.
Access pattern support	Supports byte-range reads. `adlfs` provides fsspec-compatible interface for Python.
Cost (East US)	~$0.018/GB/month (Hot), ~$0.01/GB/month (Cool), ~$0.002/GB/month (Archive)
Python SDK	`azure-storage-blob` + `adlfs` for fsspec integration.
Astronomy ecosystem	Less astronomy community adoption than S3. No direct MAST integration.

Key advantage: If deploying to Azure, the .NET backend gets the most natural SDK experience and authentication story (Managed Identity).

3. Google Cloud Storage

Fit: Moderate

Aspect	Assessment
Storage classes	Standard/Nearline/Coldline/Archive with lifecycle management.
Analytics integration	Strong BigQuery integration if metadata analytics become important.
Cost	~$0.020/GB/month (Standard), ~$0.010/GB/month (Nearline)
Python SDK	`gcsfs` for fsspec integration.
Astronomy ecosystem	Least adoption in the astronomy community. No MAST S3-compatible access.

Key advantage: Best option if the deployment target is GCP or if BigQuery-based metadata analytics are planned.

4. MinIO (S3-Compatible, Self-Hosted)

Fit: Strong for hybrid/on-prem

Aspect	Assessment
API compatibility	Full S3 API — all S3 tooling (`boto3`, `s3fs`, `fsspec`) works without modification.
Deployment flexibility	Runs on-prem, in any cloud, or alongside existing Docker Compose stack.
Vendor lock-in	None. Can migrate to/from any S3-compatible service.
Cost	Infrastructure cost only (no per-GB cloud charges).
Overhead	You manage the infrastructure, replication, and backups.

Key advantage: Preserves the local-first, privacy-conscious philosophy while providing S3 API compatibility. Good stepping stone — develop against S3 API locally, deploy to any S3-compatible service in production.

5. Direct MAST Cloud Access (S3 Public Buckets)

Fit: Complementary

Aspect	Assessment
Approach	Read JWST data directly from STScI's S3 buckets instead of downloading.
Implementation	`astroquery` supports cloud access via `enable_cloud_dataset()`.
Latency	Higher per-read latency vs. local, but eliminates download wait entirely.
Cost	Free egress from same-region S3 (us-east-1).
Limitation	Read-only. Still need separate storage for processed outputs.

Key advantage: Eliminates the download pipeline for public JWST data. Best used alongside one of the above options for storing processed results.

Recommendation

Primary: S3 or S3-Compatible Storage

S3 is the recommended storage backend for three reasons:

Astronomy ecosystem convergence — STScI serves JWST data from S3. Using S3 enables direct cloud-to-cloud reads, potentially bypassing the chunked download system entirely for public observations.
Access pattern alignment — The write-once, read-many pattern with large binary files is the exact use case object storage is designed for. Byte-range GET requests address the HDU random-access requirement.
Tiered storage for data lifecycle — Raw FITS files (_uncal) are rarely accessed after calibration. Lifecycle policies can automatically transition them to cheaper tiers without application changes.

Implementation Strategy

Phase 1: Abstract file access with `fsspec`

Add a storage abstraction using fsspec, which provides a unified filesystem interface across local, S3, Azure, and GCP backends. This avoids hard-coding any provider.

# requirements.txt additions
fsspec>=2024.0.0
s3fs>=2024.0.0        # for S3/MinIO
# adlfs>=2024.0.0     # for Azure (if needed)

# Usage - same code works for local and cloud
import fsspec

fs = fsspec.filesystem('s3', anon=False)  # or 'file' for local
with fs.open('bucket/path/to/file.fits', 'rb') as f:
    hdul = fits.open(f)

Phase 2: Configure storage backend via environment

# .env
STORAGE_BACKEND=s3              # or "local", "azure", "gcs"
STORAGE_BUCKET=jwst-analysis
STORAGE_PREFIX=data/mast
AWS_REGION=us-east-1

# For MinIO (local development)
STORAGE_BACKEND=s3
S3_ENDPOINT_URL=http://minio:9000
AWS_ACCESS_KEY_ID=minioadmin
AWS_SECRET_ACCESS_KEY=minioadmin

Phase 3: Enable direct MAST cloud access

from astroquery.mast import Observations
Observations.enable_cloud_dataset()
# Downloads now pull from S3 instead of MAST HTTP when available

Phase 4: Add lifecycle policies

{
  "Rules": [
    {
      "ID": "archive-raw-data",
      "Filter": { "Prefix": "data/mast/", "Tag": { "Key": "processing_level", "Value": "raw" } },
      "Transitions": [
        { "Days": 30, "StorageClass": "STANDARD_IA" },
        { "Days": 90, "StorageClass": "GLACIER" }
      ]
    }
  ]
}

If deploying to Azure specifically

Replace S3 with Azure Blob Storage. The fsspec + adlfs approach keeps the Python code identical. The .NET backend benefits from Azure.Storage.Blobs SDK and Managed Identity authentication.

Cost Estimates (100 GB of FITS data)

Provider	Hot	Warm	Cold	Egress (10 GB/month)
S3	$2.30/mo	$1.25/mo	$0.40/mo	$0.90/mo
Azure Blob	$1.80/mo	$1.00/mo	$0.20/mo	$0.87/mo
GCS	$2.00/mo	$1.00/mo	$0.40/mo	$1.20/mo
MinIO	Infra only	—	—	—

At typical research volumes (100 GB - 1 TB), cloud storage costs are negligible compared to compute costs for image processing.

Decision Matrix

Criterion	Weight	S3	Azure Blob	GCS	MinIO
JWST ecosystem fit	High	5	3	2	4
fsspec/Python support	High	5	4	4	5
.NET SDK quality	Medium	4	5	3	4
Tiered storage	Medium	5	5	4	2
Vendor independence	Medium	3	3	3	5
Operational simplicity	Medium	5	4	4	2
Cost efficiency	Low	4	4	4	5
Weighted total		4.4	3.9	3.3	3.7

Scores: 1 (poor) to 5 (excellent).