How-to manage S3 cloud object storage with AWS CLI#

This instructional guide shows you how to upload files from your hub to AWS S3 cloud object storage. In this example, we cover some basic AWS CLI commands for managing S3 objects within cloud object storage for your hub.

Who is this guide for?

Some community hubs running on AWS infrastructure have scratch and/or persistent S3 storage buckets already configured. This documentation is intended for users with a hub that has this feature enabled.

Warning

Transferring large amounts of data to the cloud can incur expensive storage costs. Please think carefully about your data requirements and use this feature responsibly. See What makes up cloud costs and how to control them for further guidance.

Basic AWS CLI commands in the Terminal#

In the Terminal, check that the AWS CLI commands are available in your image with

Note

We recommend using the Pangeo notebook image, which has the AWS CLI package already installed.

$ which aws
/srv/conda/envs/notebook/bin/aws

If this returns nothing, then you can temporarily install the package with

curl https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip -o $HOME/.local/awscliv2.zip
unzip $HOME/.local/awscliv2.zip
export PATH=$HOME/.local/aws/dist:$PATH

Note

The following examples are for managing objects in a scratch bucket using the $SCRATCH_BUCKET environment variable. For persistent buckets, this can be replaced with the $PERSISTENT_BUCKET environment variable. See Scratch versus Persistent Buckets.

List prefixes within an S3 bucket#

Prefix

There is no concept of “folders” in flat cloud object storage and every object is instead indexed with a key-value pair. Prefixes are a string of characters at the beginning of the object key name used to organize objects in a similar way to folders.

Storage buckets on a 2i2c hub are organized into prefixes named after a hub user’s username. To list the prefixes of users that have stored files in cloud object storage, use the command

$ aws s3 ls $SCRATCH_BUCKET
                           PRE <username1>/
                           PRE <username2>/

where the label PRE indicates the item listed is a prefix and not an object.

List the contents of your prefix#

List the contents of files stored under your own prefix with the command

aws s3 ls $SCRATCH_BUCKET/

Note

Note the trailing slash / after $SCRATCH_BUCKET compared to the command specified in List prefixes within an S3 bucket.

Copy files on the hub to and from a bucket#

Copy a file on the hub to your prefix in the scratch bucket with the command

$ aws s3 cp <filepath> $SCRATCH_BUCKET/
upload: ./<filepath> to s3://2i2c-aws-us-scratch-showcase/<username>/<filepath>

and copy a file from your prefix in the scratch bucket with the command

$ aws s3 cp $SCRATCH_BUCKET/<source_filepath> <target_filepath>
download: s3://2i2c-aws-us-scratch-showcase/<username>/<source_filepath> to ./<target_filepath>

Delete a file from a bucket#

Delete a file from your prefix in a bucket with the command

$ aws s3 rm $SCRATCH_BUCKET/<filepath>
delete: s3://2i2c-aws-us-scratch-researchdelight/<username>/<filepath>

Tip

Consult the AWS Docs – Use high-level (s3) commands with the AWS CLI for a more detailed guide of AWS commands for managing S3 objects.

Note

As mentioned in Access permissions, anyone can access each other’s files in object storage on the hub. Be careful about which objects you are deleting.

Upload files to an S3 bucket from outside the hub#

We outline a workflow for how to transfer datasets to the AWS bucket from outside the hub, such as your local machine or a remote server. This is done by generating a temporary access token that is valid for up to 1 hour.

Tip

The following workflow assumes you have a Unix-like operating system from outside the hub.

  1. Set up a new software environment on your local machine

    mamba create --name aws_transfer aws-cli
    
  2. Activate the environment

    mamba activate aws_transfer
    
  3. Generate a temporary access token from your 2i2c hub

    Note

    We recommend using the Pangeo notebook image, which has the AWS CLI package already installed.

    aws sts assume-role-with-web-identity --role-arn $AWS_ROLE_ARN --role-session-name $JUPYTERHUB_CLIENT_ID --web-identity-token "$(cat $AWS_WEB_IDENTITY_TOKEN_FILE)" --duration-seconds 1000 
    

    Tip

    This access token is valid for 1000 seconds by setting the --duration-seconds flag. The maximum value is 3600 seconds (1 hour), but we recommend setting this to the minimum time needed to transfer your data for security reasons. Please see AWS Docs – aws sts assume-role-with-web-identity for further information.

  4. Note the key-values returned for AccessKeyId, SecretAccessKey and SessionToken

  5. Configure the ~/.aws/credentials file on your local machine with a new profile using the following commands

    aws configure set aws_access_key_id <AccessKeyId> --profile <profile_name>
    aws configure set aws_secret_access_key <SecretAccessKey> --profile <profile_name>
    aws configure set aws_session_token <SessionToken> --profile <profile_name>
    

    Tip

    See the AWS Docs – aws configure set for more information.

  6. Set the region in your ~/.aws/config file on your local machine using the following command

    aws configure set region <data_center_location>
    

    Tip

    See the FAQs below for how to find the data center location of your hub.

  7. Define the AWS_PROFILE environment variable on your local machine

    AWS_PROFILE=<profile_name>
    
  8. Define the $SCRATCH_BUCKET environment variable

    SCRATCH_BUCKET=s3://<bucket_name>/<username> 
    
  9. Upload the data to the storage bucket

    $ aws s3 cp <your-data> $SCRATCH_BUCKET
    upload: ./<your-data> to s3://<bucket_name>/<username>/<your-data>
    
  10. Check the contents of your prefix

    $ aws s3 ls $SCRATCH_BUCKET/
    2024-07-04 17:01:54          4 <your-data>
    

    Tip

    Note the trailing slash / after $SCRATCH_BUCKET.

FAQs#

  • How do I know if our hub is running on AWS or not?

    Check out our list of running hubs under the column provider to see which cloud provider your hub is running on.

  • Where is the location of the data center our hub is running on?

    Check out our list of running hubs under the column data center location.

  • How do I determine if a scratch and/or persistent bucket is already available?

    Check whether the environment variables for each bucket are set. See Scratch buckets and Persistent buckets

  • If S3 buckets are supposed to be available but the environment variables for AWS credentials are not defined, what should I do?

    If environment variables for the relevant AWS credentials for your hub are not defined, then you may encounter the following error

    An error occurred (AccessDenied) when calling the AssumeRoleWithWebIdentity operation: Not authorized to perform sts:AssumeRoleWithWebIdentity.
    

    Please contact your hub champion so that they can open a 2i2c support ticket with us to resolve this issue on your behalf.

  • If S3 buckets are not set up but I want them for my community what should the I do?

    This feature is not enabled by default since there are extra cloud costs associated with providing S3 object storage. Please speak to your hub champion, who can then open a 2i2c support ticket with us to request this feature for your hub.

  • Will 2i2c create additional, new S3 buckets for our community?

    Please contact contact your hub champion to liaise with 2i2c support to discuss this option.

  • If a our hub is running on GCP or Azure and we have object storage, what are our options?

    Check out our resources listed in the Cloud Object Storage user topic guide.