How-to manage S3 cloud object storage with AWS CLI#
This instructional guide shows you how to upload files from your hub to AWS S3 cloud object storage. In this example, we cover some basic AWS CLI commands for managing S3 objects within cloud object storage for your hub.
Who is this guide for?
Some community hubs running on AWS infrastructure have scratch and/or persistent S3 storage buckets already configured. This documentation is intended for users with a hub that has this feature enabled.
Warning
Transferring large amounts of data to the cloud can incur expensive storage costs. Please think carefully about your data requirements and use this feature responsibly. See What makes up cloud costs and how to control them for further guidance.
Basic AWS CLI commands in the Terminal#
In the Terminal, check that the AWS CLI commands are available in your image with
Note
We recommend using the Pangeo notebook image, which has the AWS CLI package already installed.
$ which aws
/srv/conda/envs/notebook/bin/aws
If this returns nothing, then you can temporarily install the package with
curl https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip -o $HOME/.local/awscliv2.zip
unzip $HOME/.local/awscliv2.zip
export PATH=$HOME/.local/aws/dist:$PATH
Note
The following examples are for managing objects in a scratch bucket using the $SCRATCH_BUCKET
environment variable. For persistent buckets, this can be replaced with the $PERSISTENT_BUCKET
environment variable. See Scratch versus Persistent Buckets.
List prefixes within an S3 bucket#
- Prefix
There is no concept of “folders” in flat cloud object storage and every object is instead indexed with a key-value pair. Prefixes are a string of characters at the beginning of the object key name used to organize objects in a similar way to folders.
Storage buckets on a 2i2c hub are organized into prefixes named after a hub user’s username. To list the prefixes of users that have stored files in cloud object storage, use the command
$ aws s3 ls $SCRATCH_BUCKET
PRE <username1>/
PRE <username2>/
where the label PRE
indicates the item listed is a prefix and not an object.
Tip
AWS Docs – Organizing objects using prefixes for more information.
List the contents of your prefix#
List the contents of files stored under your own prefix with the command
aws s3 ls $SCRATCH_BUCKET/
Note
Note the trailing slash /
after $SCRATCH_BUCKET
compared to the command specified in List prefixes within an S3 bucket.
Copy files on the hub to and from a bucket#
Copy a file on the hub to your prefix in the scratch bucket with the command
$ aws s3 cp <filepath> $SCRATCH_BUCKET/
upload: ./<filepath> to s3://2i2c-aws-us-scratch-showcase/<username>/<filepath>
and copy a file from your prefix in the scratch bucket with the command
$ aws s3 cp $SCRATCH_BUCKET/<source_filepath> <target_filepath>
download: s3://2i2c-aws-us-scratch-showcase/<username>/<source_filepath> to ./<target_filepath>
Delete a file from a bucket#
Delete a file from your prefix in a bucket with the command
$ aws s3 rm $SCRATCH_BUCKET/<filepath>
delete: s3://2i2c-aws-us-scratch-researchdelight/<username>/<filepath>
Tip
Consult the AWS Docs – Use high-level (s3) commands with the AWS CLI for a more detailed guide of AWS commands for managing S3 objects.
Note
As mentioned in Access permissions, anyone can access each other’s files in object storage on the hub. Be careful about which objects you are deleting.
Upload files to an S3 bucket from outside the hub#
We outline a workflow for how to transfer datasets to the AWS bucket from outside the hub, such as your local machine or a remote server. This is done by generating a temporary access token that is valid for up to 1 hour.
Tip
The following workflow assumes you have a Unix-like operating system from outside the hub.
Set up a new software environment on your local machine
mamba create --name aws_transfer aws-cli
Activate the environment
mamba activate aws_transfer
Generate a temporary access token from your 2i2c hub
Note
We recommend using the Pangeo notebook image, which has the AWS CLI package already installed.
aws sts assume-role-with-web-identity --role-arn $AWS_ROLE_ARN --role-session-name $JUPYTERHUB_CLIENT_ID --web-identity-token "$(cat $AWS_WEB_IDENTITY_TOKEN_FILE)" --duration-seconds 1000
Tip
This access token is valid for 1000 seconds by setting the
--duration-seconds
flag. The maximum value is 3600 seconds (1 hour), but we recommend setting this to the minimum time needed to transfer your data for security reasons. Please see AWS Docs – aws sts assume-role-with-web-identity for further information.Note the key-values returned for
AccessKeyId
,SecretAccessKey
andSessionToken
Configure the
~/.aws/credentials
file on your local machine with a new profile using the following commandsaws configure set aws_access_key_id <AccessKeyId> --profile <profile_name> aws configure set aws_secret_access_key <SecretAccessKey> --profile <profile_name> aws configure set aws_session_token <SessionToken> --profile <profile_name>
Tip
See the AWS Docs – aws configure set for more information.
Set the
region
in your~/.aws/config
file on your local machine using the following commandaws configure set region <data_center_location>
Tip
See the FAQs below for how to find the data center location of your hub.
Define the
AWS_PROFILE
environment variable on your local machineAWS_PROFILE=<profile_name>
Define the
$SCRATCH_BUCKET
environment variableSCRATCH_BUCKET=s3://<bucket_name>/<username>
Upload the data to the storage bucket
$ aws s3 cp <your-data> $SCRATCH_BUCKET upload: ./<your-data> to s3://<bucket_name>/<username>/<your-data>
Check the contents of your prefix
$ aws s3 ls $SCRATCH_BUCKET/ 2024-07-04 17:01:54 4 <your-data>
Tip
Note the trailing slash
/
after$SCRATCH_BUCKET
.
FAQs#
How do I know if our hub is running on AWS or not?
Check out our list of running hubs under the column provider to see which cloud provider your hub is running on.
Where is the location of the data center our hub is running on?
Check out our list of running hubs under the column data center location.
How do I determine if a scratch and/or persistent bucket is already available?
Check whether the environment variables for each bucket are set. See Scratch buckets and Persistent buckets
If S3 buckets are supposed to be available but the environment variables for AWS credentials are not defined, what should I do?
If environment variables for the relevant AWS credentials for your hub are not defined, then you may encounter the following error
An error occurred (AccessDenied) when calling the AssumeRoleWithWebIdentity operation: Not authorized to perform sts:AssumeRoleWithWebIdentity.
Please contact your hub champion so that they can open a 2i2c support ticket with us to resolve this issue on your behalf.
If S3 buckets are not set up but I want them for my community what should the I do?
This feature is not enabled by default since there are extra cloud costs associated with providing S3 object storage. Please speak to your hub champion, who can then open a 2i2c support ticket with us to request this feature for your hub.
Will 2i2c create additional, new S3 buckets for our community?
Please contact contact your hub champion to liaise with 2i2c support to discuss this option.
If a our hub is running on GCP or Azure and we have object storage, what are our options?
Check out our resources listed in the Cloud Object Storage user topic guide.