Cloud Object Storage#

This section gives an overview of storing data in the cloud, as well as links to how-to guides for using specific tools to manage your cloud data:

Overview#

Your hub lives in the cloud, therefore the preferred way to store data is using object storage, such as Amazon S3 or Google Cloud Storage. Cloud object storage is essentially a key/value storage system. The keys are strings and the values are bytes of data. Data is read and written using HTTP calls.

The performance of object storage is very different from file storage. On one hand each individual read / write to object storage has a high overhead (10-100 milliseconds) since it has to go over the network, while on the other hand object storage “scales out” nearly infinitely, meaning that we can make hundreds, thousands, or millions of concurrent read / write requests. This makes object storage well suited for distributed data analytics. However, data analysis software must be adapted to take advantage of these properties.

Scratch versus persistent buckets on a 2i2c hub#

Bucket

A bucket is a container for objects.

Object

An object is a file and any metadata that describes that file.

Scratch buckets#

Scratch buckets are designed for storage of temporary files, e.g. intermediate results.

Tip

Any data in a scratch bucket is deleted after 7 days.

Do not use scratch buckets to permanently store critical data.

Check the name of your scratch bucket by opening a Terminal in your hub and running the command

$ echo $SCRATCH_BUCKET
s3://2i2c-aws-us-scratch-showcase/<username>

Persistent buckets#

Persistent buckets are designed for storing data that is consistently used throughout the lifetime of a project and the data is not purged after a set number of days.

Check the name of your persistent bucket by opening a Terminal in your hub and running the command

$ echo $PERSISTENT_BUCKET
s3://2i2c-aws-us-persistent-showcase/<username>

Storage costs#

See 2i2c Infrastructure Guide – What exactly do cloud providers charge us for? for a detailed overview of cloud object storage costs.

Tip

It is the responsibility of the hub admin and hub users to delete objects in $PERSISTENT_BUCKET when no longer needed to minimize cloud billing costs. Hub champions are responsible for managing storage costs and objects stored in $PERSISTENT_BUCKET.

Tip

Every file you download from the hub to another machine incurs a heavy data egress cost. Consider carefully whether you need to download large datasets from the hub, or alternatively post-process and compress files if possible. Hub champions are responsible for costs incurred from data egress.

Access permissions#

A common set of credentials is used for accessing storage buckets.

Tip

Hub users can access each others’ objects stored in scratch or persistent bucket storage and accidentally modify or delete them.

It is possible to configure read-only access for objects stored in cloud storage on your hub, though this is not a standard feature of our hubs. Please consult 2i2c support to discuss enabling this feature.