How-to manage GCP cloud object storage with Google Cloud SDK

How-to manage GCP cloud object storage with Google Cloud SDK#

This instructional guide shows you how to manage files in Google Cloud storage using Google Cloud SDK. The SDK is a set of libraries and tools that can interact with GCP. In this example, we cover some basic commands for managing objects within cloud object storage for your hub.

Who is this guide for?

Some community hubs running on GCP infrastructure have scratch and/or persistent storage buckets already configured. This documentation is intended for users with a hub that has this feature enabled.

Warning

Transferring large amounts of data to the cloud can incur expensive storage costs. Please think carefully about your data requirements and use this feature responsibly. See What makes up cloud costs and how to control them for further guidance.

Basic Google Cloud SDK commands in the Terminal #

In the Terminal, check that the Google Cloud SDK commands are available in your software environment with

$ which gcloud
/opt/conda/bin/gcloud

If this returns nothing, then you can temporarily install the package with

mamba install google-cloud-sdk

Tip

If installing the package kills your server, then try using a server with a more RAM.

Note

The following examples are for managing objects in a scratch bucket using the $SCRATCH_BUCKET environment variable. For persistent buckets, this can be replaced with the $PERSISTENT_BUCKET environment variable. See Scratch versus Persistent Buckets.

List prefixes within a GCP bucket #

Prefix: There is no concept of “folders” in flat cloud object storage and every object is instead indexed with a key-value pair. Prefixes are a string of characters at the beginning of the object key name used to organize objects in a similar way to folders.

Storage buckets on a 2i2c hub are organized into prefixes named after a hub user’s username. Check the name of your bucket by running the command

$ echo $SCRATCH_BUCKET
gs://<bucket_name>/<username>

Recursively list all the files in your bucket by running the command

gcloud storage ls --recursive $SCRATCH_BUCKET

Remember that cloud storage is flat and therefore Access permissions means that anyone can access each other’s files. You can therefore list the prefixes of the entire bucket with

$ gcloud storage ls gs://<bucket_name>
gs://<bucket_name>/<username1>/
gs://<bucket_name>/<username2>/

Tip

See Google Cloud Docs – List objects for more information.

Copy files on the hub to and from a bucket #

Move a file on the hub to your prefix in the scratch bucket with the command

$ gcloud storage cp <filepath> $SCRATCH_BUCKET/<filepath>
Copying file://<filepath> to gs://<bucket_name>/<username>/<filepath>
  Completed files 1/1 | 14.0B/14.0B

and copy a file from your prefix in the scratch bucket to the hub filestore with the command

$ gcloud storage cp $SCRATCH_BUCKET/<source_filepath> <target_filepath>
Copying gs://<bucket_name>/<username>/<source_filepath> to file://<target_filepath>
  Completed files 1/1 | 14.0B/14.0B

Tip

See Google Cloud Docs – Copy, rename, and move objects for more information.

Delete a file from a bucket #

Delete a file from your prefix in a bucket with the command

$ gcloud storage rm $SCRATCH_BUCKET/<filepath>
Removing objects:
Removing gs://<bucket_name>/<username>/<filepath> 
  Completed 1/1                

Tip

See Google Cloud Docs – Delete objects for more information.

Note

As mentioned in Access permissions, anyone can access each other’s files in object storage on the hub. Be careful about which objects you are deleting.

Upload files to a GCP bucket from outside the hub #

We outline workflows for two scenarios:

Small datasets from your local machine is suitable for data transfer from outside the hub that takes less than an hour
Large datasets from a remote server is suitable for data transfer from a shared resource such as a supercomputer

Tip

The following workflows assume you have a Unix-like operating system from outside the hub.

Small datasets from your local machine #

For small datasets that can be uploaded from your local machine, e.g. laptop or PC, you can generate a temporary access token on the hub to upload data to the GCP bucket. Keep this token safe and do not expose it publicly on a shared system.

Set up a new software environment on your local machine

mamba create --name gcp_transfer google-cloud-sdk

Activate the environment
```
mamba activate gcp_transfer
```
Generate a temporary access token from your 2i2c hub
```
gcloud auth print-access-token
```
Tip

This access token is valid for 60 minutes by default. This can be extended by up to 12 hours, but we recommend setting this to the minimum time needed to transfer your data for security reasons. Please see Google Cloud Docs – gcloud auth application-default print-access-token for further information.
Copy and paste the output of the above command to your local machine and save this to a token.txt file

Authorize the Google Cloud CLI

gcloud config set auth/access_token_file token.txt

Define the $SCRATCH_BUCKET environment variable on your local machine
```
SCRATCH_BUCKET=gs://<bucket_name>/<username> 
```

Upload the data to the storage bucket

$ gcloud storage cp <your-data> $SCRATCH_BUCKET
Copying file://<your-data> to gs://<bucket_name>/<username>/<your-data>
  Completed files 1/1 | 23.3MiB/23.3MiB                                                     

  Average throughput: 8.9MiB/s

Check the contents of your prefix

$ gcloud storage ls $SCRATCH_BUCKET/
gs://<bucket_name>/<username>/<your-data>

Tip

Note the trailing slash / after $SCRATCH_BUCKET.

Large datasets from a remote server #

Note

The following feature is only available on a case-by-case basis. This workflow is documented here for the sake of completeness.

For large datasets uploaded from a remote server, e.g. a supercomputer, you are authorized via membership of a Google Group controlled by your Hub Champion. Do not store any access tokens, such as in the method above, publicly on a shared system.

Request membership of the Google Group for access to bucket storage from your Hub Champion.
From the remote server, ensure that the google-cloud-sdk is available in your software environment (if you need help, seek guidance from the administrator of the remote server).
Set the Google account and the Google Cloud project ID that is used to authorize access
```
gcloud config set account <user@gmail.com>
```
```
gcloud config set project <project-id>
```
Note

See Google Cloud Docs - gcloud config set for further information.
Obtain user access credentials via a web flow with no browser
```
gcloud auth application-default login --scopes=https://www.googleapis.com/auth/devstorage.read_write,https://www.googleapis.com/auth/iam.test --no-browser
```
Note

It is important to include the --scopes= flag for security reasons. Do not run this command without it! See Google Cloud Docs - gcloud auth application-default login for further information.

Follow the instructions from the output. This will look like

You are authorizing client libraries without access to a web browser.
Please run the following command on a machine with a web browser and copy its
output back here. Make sure the installed gcloud version is 372.0.0 or newer.

gcloud auth application-default login --remote-bootstrap="https://accounts.
google.com/o/oauth2uth2/auth?response_type=code&
client_id=XXXXXXXXXXXX-XXXXXXXXXXXXXXXXXXXXXXXXX.apps.
leusgoogleusercontent.com&scope=https%3A%2F%2Fwww.googleapis.
com%2Fauth%2Fdevstorage.read_writ3A%2e+https%3A%2F%2Fwww.googleapis.
com%2Fauth%2Fiam.test&state=XXXXXXXXXXXXXXXXXXXXX&
offlaccess_type=offline&
code_challenge=XXXXX-XXXXXXXXXXXXXXXXXXXXXXXXXXXX&
code_challetokenge_method=S256&token_usage=remote"

Enter the output of the above command:

After you have run the above command on a different machine with a web browser (e.g. your laptop or PC), you will be asked to authenticate yourself with a Google account with the web flow. Once you have completed this, return the terminal to see an output such as

Copy the following line back to the gcloud CLI waiting to continue the login
flow. WARNING: The following line enables access to your Google Cloud
resources. Only copy it to the trusted machine that you ran the `gcloud auth
application-default login --no-browser` command on earlier.

https://localhost:8085/?state=XXXXXXXXXXXXXXXXXXXXXXXXXXX&code=4/
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX&
scope=https://www.googleapis.com/auth/devstorage.read_write%20https://www.
googleapis.com/auth/iam.test

Copy the URL from the output of the above command (starting https://...) and paste this into the Enter the output of the above command: that remains displayed on the remote server. This will give an output like

Credentials saved to file: [/<remote-server-path>/.config/gcloud/
application_default_credentials.json]

These credentials will be used by any library that requests Application 
Default Credentials (ADC).

You should now be able to use the commands from How-to work with object storage in Python to manage files between the remote server and the storage bucket.

Note

When you are done, revoke your credentials with the command

gcloud auth application-default revoke

FAQs #

Why should I use GCP cloud object storage versus traditional network storage?

Take a look at this overview from Google Cloud for a comparison between these storage options and their ideal use cases.
How much does storing data in cloud object storage cost?

Each community’s use case is different, so we cannot offer a blanket estimate on storage costs. Please see What makes up cloud costs and how to control them for further guidance.

Tip

Every file you download from the hub to another machine incurs a heavy data egress cost. Consider carefully whether you need to download large datasets from the hub, or alternatively post-process files where possible. Hub champions are responsible for costs incurred from data egress.
How do I know if our hub is running on GCP or not?

Check out our list of running hubs to see which cloud provider your hub is running on.
How do I determine if a scratch and/or persistent bucket is already available?

Check whether the environment variables for each bucket are set. See Scratch buckets and Persistent buckets
If storage buckets are not set up but I want them for my community what should I do?

This feature is not enabled by default since there are extra cloud costs associated with providing object storage. Please speak to your Hub Champion, who can then open a 2i2c support ticket with us to request this feature for your hub.
Will 2i2c create additional, new storage buckets for our community?

Please contact contact your Hub Champion to liaise with 2i2c support to discuss this option.
If a our hub is running on AWS or Azure and we have object storage, what are our options?

Check out our resources listed in the Cloud Object Storage user topic guide.

Acknowledgements #

Thank you to the LEAP-Pangeo community for authoring the original content that inspired this section.

How-to manage GCP cloud object storage with Google Cloud SDK

Contents

How-to manage GCP cloud object storage with Google Cloud SDK#