How-to work with object storage in Python

How-to work with object storage in Python#

Warning

Transferring large amounts of data to the cloud can incur expensive storage costs. Please think carefully about your data requirements and use this feature responsibly. See What makes up cloud costs and how to control them for further guidance.

Cloud-Native Formats#

Cloud-native file formats are designed to work well with cloud object storage. These formats permit exploration of data and metadata without downloading the entire file / dataset and work well with distributed parallel computing. Here are some popular cloud-native formats and their use cases:

Format	Use Case	Python Libraries
Apache Parquet	Column-oriented data file format designed for efficient data storage and retrieval. Suitable for tabular-style data (rows and columns).	pandas, dask.dataframe, vaex, pyarrow
Zarr	Storage of large multidimensional arrays	zarr, numpy, dask.array, xarray
Cloud Optimized Geotiff	Geospatial raster data	rasterio, rio-xarray

There are other more specialized cloud-optimized formats for specific scientific domains.

It is recommended to use cloud-native formats when working with big data in cloud object storage.

Data Catalogs#

To make it easier to discover share data in your project, it is recommended to use data catalogs. Intake is a popular tool for making data catalogs in python.

Below is an example of an intake data catalog for loading Zarr data in Xarray from OpenStorageNetwork. (This example is borrowed from the Ocean Eddy CPT project.)

plugins:
  source:
    - module: intake_xarray

sources:

  neverworld_five_day_averages:
    description: Five-day-average fields from Neverworld2
    driver: zarr
    args:
      urlpath: s3://Pangeo/ocean-eddy-cpt/5-day-averages/
      consolidated: True
      storage_options:
        anon: True
        client_kwargs:
          endpoint_url: 'https://ncsa.osn.xsede.org'

  neverworld_quarter_degree_snapshots:
    description: snapshots of fields from Neverworld2
    driver: zarr
    args:
      urlpath: s3://Pangeo/ocean-eddy-cpt/quarter-degree/snapshots/
      consolidated: True
      storage_options:
        anon: True
        client_kwargs:
          endpoint_url: 'https://ncsa.osn.xsede.org'

To use this catalog, place it online and share the URL with your team.

Here is an example of how to use this catalog file:

import intake
cat_url = "https://raw.githubusercontent.com/ocean-eddy-cpt/cpt-data/master/catalog.yaml"
cat = intake.open_catalog(cat_url)
list(cat)  # discover what is in the catalog
ds = cat['neverworld_five_day_averages'].to_dask()  # open lazily with Xarray

How-to work with object storage in Python

Contents

How-to work with object storage in Python#

Cloud-Native Formats#

Tools#

Reading Data#

Writing Data#

Example – Writing to a Scratch Bucket#

Data Catalogs#