How-to manage GCP cloud object storage with Google Cloud SDK#
This instructional guide shows you how to manage files in Google Cloud storage using Google Cloud SDK. The SDK is a set of libraries and tools that can interact with GCP. In this example, we cover some basic commands for managing objects within cloud object storage for your hub.
Who is this guide for?
Some community hubs running on GCP infrastructure have scratch and/or persistent storage buckets already configured. This documentation is intended for users with a hub that has this feature enabled.
Warning
Transferring large amounts of data to the cloud can incur expensive storage costs. Please think carefully about your data requirements and use this feature responsibly. See What makes up cloud costs and how to control them for further guidance.
Basic Google Cloud SDK commands in the Terminal#
In the Terminal, check that the Google Cloud SDK commands are available in your software environment with
$ which gcloud
/opt/conda/bin/gcloud
If this returns nothing, then you can temporarily install the package with
mamba install google-cloud-sdk
Tip
If installing the package kills your server, then try using a server with a more RAM.
Note
The following examples are for managing objects in a scratch bucket using the $SCRATCH_BUCKET
environment variable. For persistent buckets, this can be replaced with the $PERSISTENT_BUCKET
environment variable. See Scratch versus Persistent Buckets.
List prefixes within a GCP bucket#
- Prefix
There is no concept of “folders” in flat cloud object storage and every object is instead indexed with a key-value pair. Prefixes are a string of characters at the beginning of the object key name used to organize objects in a similar way to folders.
Storage buckets on a 2i2c hub are organized into prefixes named after a hub user’s username. Check the name of your bucket by running the command
$ echo $SCRATCH_BUCKET
gs://<bucket_name>/<username>
Recursively list all the files in your bucket by running the command
gcloud storage ls --recursive $SCRATCH_BUCKET
Remember that cloud storage is flat and therefore Access permissions means that anyone can access each other’s files. You can therefore list the prefixes of the entire bucket with
$ gcloud storage ls gs://<bucket_name>
gs://<bucket_name>/<username1>/
gs://<bucket_name>/<username2>/
Tip
See Google Cloud Docs – List objects for more information.
Copy files on the hub to and from a bucket#
Move a file on the hub to your prefix in the scratch bucket with the command
$ gcloud storage cp <filepath> $SCRATCH_BUCKET/<filepath>
Copying file://<filepath> to gs://<bucket_name>/<username>/<filepath>
Completed files 1/1 | 14.0B/14.0B
and copy a file from your prefix in the scratch bucket to the hub filestore with the command
$ gcloud storage cp $SCRATCH_BUCKET/<source_filepath> <target_filepath>
Copying gs://<bucket_name>/<username>/<source_filepath> to file://<target_filepath>
Completed files 1/1 | 14.0B/14.0B
Tip
See Google Cloud Docs – Copy, rename, and move objects for more information.
Delete a file from a bucket#
Delete a file from your prefix in a bucket with the command
$ gcloud storage rm $SCRATCH_BUCKET/<filepath>
Removing objects:
Removing gs://<bucket_name>/<username>/<filepath>
Completed 1/1
Tip
See Google Cloud Docs – Delete objects for more information.
Note
As mentioned in Access permissions, anyone can access each other’s files in object storage on the hub. Be careful about which objects you are deleting.
Upload files to a GCP bucket from outside the hub#
We outline workflows for two scenarios:
Small datasets from your local machine is suitable for data transfer from outside the hub that takes less than an hour
Large datasets from a remote server is suitable for data transfer from a shared resource such as a supercomputer
Tip
The following workflows assume you have a Unix-like operating system from outside the hub.
Small datasets from your local machine#
For small datasets that can be uploaded from your local machine, e.g. laptop or PC, you can generate a temporary access token on the hub to upload data to the GCP bucket. Keep this token safe and do not expose it publicly on a shared system.
Set up a new software environment on your local machine
mamba create --name gcp_transfer google-cloud-sdk
Activate the environment
mamba activate gcp_transfer
Generate a temporary access token from your 2i2c hub
gcloud auth print-access-token
Tip
This access token is valid for 60 minutes by default. This can be extended by up to 12 hours, but we recommend setting this to the minimum time needed to transfer your data for security reasons. Please see Google Cloud Docs – gcloud auth application-default print-access-token for further information.
Copy and paste the output of the above command to your local machine and save this to a
token.txt
fileAuthorize the Google Cloud CLI
gcloud config set auth/access_token_file token.txt
Define the
$SCRATCH_BUCKET
environment variable on your local machineSCRATCH_BUCKET=gs://<bucket_name>/<username>
Upload the data to the storage bucket
$ gcloud storage cp <your-data> $SCRATCH_BUCKET Copying file://<your-data> to gs://<bucket_name>/<username>/<your-data> Completed files 1/1 | 23.3MiB/23.3MiB Average throughput: 8.9MiB/s
Check the contents of your prefix
$ gcloud storage ls $SCRATCH_BUCKET/ gs://<bucket_name>/<username>/<your-data>
Tip
Note the trailing slash
/
after$SCRATCH_BUCKET
.
Large datasets from a remote server#
Note
The following feature is only available on a case-by-case basis. This workflow is documented here for the sake of completeness.
For large datasets uploaded from a remote server, e.g. a supercomputer, you are authorized via membership of a Google Group controlled by your Hub Champion. Do not store any access tokens, such as in the method above, publicly on a shared system.
Request membership of the Google Group for access to bucket storage from your Hub Champion.
From the remote server, ensure that the
google-cloud-sdk
is available in your software environment (if you need help, seek guidance from the administrator of the remote server).Set the Google account and the Google Cloud project ID that is used to authorize access
gcloud config set account <user@gmail.com>
gcloud config set project <project-id>
Note
See Google Cloud Docs - gcloud config set for further information.
Obtain user access credentials via a web flow with no browser
gcloud auth application-default login --scopes=https://www.googleapis.com/auth/devstorage.read_write,https://www.googleapis.com/auth/iam.test --no-browser
Note
It is important to include the
--scopes=
flag for security reasons. Do not run this command without it! See Google Cloud Docs - gcloud auth application-default login for further information.Follow the instructions from the output. This will look like
You are authorizing client libraries without access to a web browser. Please run the following command on a machine with a web browser and copy its output back here. Make sure the installed gcloud version is 372.0.0 or newer. gcloud auth application-default login --remote-bootstrap="https://accounts. google.com/o/oauth2uth2/auth?response_type=code& client_id=XXXXXXXXXXXX-XXXXXXXXXXXXXXXXXXXXXXXXX.apps. leusgoogleusercontent.com&scope=https%3A%2F%2Fwww.googleapis. com%2Fauth%2Fdevstorage.read_writ3A%2e+https%3A%2F%2Fwww.googleapis. com%2Fauth%2Fiam.test&state=XXXXXXXXXXXXXXXXXXXXX& offlaccess_type=offline& code_challenge=XXXXX-XXXXXXXXXXXXXXXXXXXXXXXXXXXX& code_challetokenge_method=S256&token_usage=remote" Enter the output of the above command:
After you have run the above command on a different machine with a web browser (e.g. your laptop or PC), you will be asked to authenticate yourself with a Google account with the web flow. Once you have completed this, return the terminal to see an output such as
Copy the following line back to the gcloud CLI waiting to continue the login flow. WARNING: The following line enables access to your Google Cloud resources. Only copy it to the trusted machine that you ran the `gcloud auth application-default login --no-browser` command on earlier. https://localhost:8085/?state=XXXXXXXXXXXXXXXXXXXXXXXXXXX&code=4/ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX& scope=https://www.googleapis.com/auth/devstorage.read_write%20https://www. googleapis.com/auth/iam.test
Copy the URL from the output of the above command (starting
https://...
) and paste this into theEnter the output of the above command:
that remains displayed on the remote server. This will give an output likeCredentials saved to file: [/<remote-server-path>/.config/gcloud/ application_default_credentials.json] These credentials will be used by any library that requests Application Default Credentials (ADC).
You should now be able to use the commands from How-to work with object storage in Python to manage files between the remote server and the storage bucket.
Note
When you are done, revoke your credentials with the command
gcloud auth application-default revoke
FAQs#
Why should I use GCP cloud object storage versus traditional network storage?
Take a look at this overview from Google Cloud for a comparison between these storage options and their ideal use cases.
How much does storing data in cloud object storage cost?
Each community’s use case is different, so we cannot offer a blanket estimate on storage costs. Please see What makes up cloud costs and how to control them for further guidance.
Tip
Every file you download from the hub to another machine incurs a heavy data egress cost. Consider carefully whether you need to download large datasets from the hub, or alternatively post-process files where possible. Hub champions are responsible for costs incurred from data egress.
How do I know if our hub is running on GCP or not?
Check out our list of running hubs to see which cloud provider your hub is running on.
How do I determine if a scratch and/or persistent bucket is already available?
Check whether the environment variables for each bucket are set. See Scratch buckets and Persistent buckets
If storage buckets are not set up but I want them for my community what should I do?
This feature is not enabled by default since there are extra cloud costs associated with providing object storage. Please speak to your Hub Champion, who can then open a 2i2c support ticket with us to request this feature for your hub.
Will 2i2c create additional, new storage buckets for our community?
Please contact contact your Hub Champion to liaise with 2i2c support to discuss this option.
If a our hub is running on AWS or Azure and we have object storage, what are our options?
Check out our resources listed in the Cloud Object Storage user topic guide.
Acknowledgements#
Thank you to the LEAP-Pangeo community for authoring the original content that inspired this section.