More and more organizations are running data applications on Google Kubernetes Engine (GKE), as evidenced by the number of Kubernetes clusters running stateful applications on GKE, which has been growing exponentially since 2019, doubling every year on average. And with the rise of both AI/ML workloads and data on Kubernetes, customers are looking for more integrated solutions across compute and storage. Specifically, customers were looking for ways to access AI/ML data stored in Google Cloud Storage (GCS) to be made easily available to containers on GKE using file semantics.
Cloud Storage FUSE for machine learning
Cloud Storage is a common choice for AI/ML workloads looking to store and access training data, models, and checkpoints. With Cloud Storage FUSE, objects in Cloud Storage buckets can be accessed as files mounted as a local file system, providing a frictionless experience for applications that need file system semantics. Cloud Storage FUSE is available today in Public Preview, with official Google support. You can deploy Cloud Storage FUSE as a regular Linux package, as part of integrations with Vertex AI, and as of today, as part of a managed turnkey offering with GKE.
Diagram 1. Cloud Storage Fuse enables file semantics access to GCS buckets
Cloud Storage FUSE CSI support on GKE
The new Cloud Storage FUSE Container Storage Interface (CSI) driver on GKE available in public preview allows Kubernetes applications to mount Cloud Storage buckets as local file systems. This provides:
- Portability: Mount and access Cloud Storage buckets with standard file system semantics, providing portability for ML workloads that eliminates application refactoring costs. Use the CSI driver to operate with familiar kubernetes APIs.
- Massive scale: From training, to inference, to checkpointing, you can leverage the performance of Cloud Storage to run your ML workloads at scale.
- Streaming data support: Start training jobs quickly by providing compute resources with direct access to data in Cloud Storage, rather than having to manage logic to copy it down to a local filesystem instance. This means you don’t need to wait for data to download before doing meaningful work.
- Built-in support for GKE Standard and Autopilot: The CSI driver deploys Cloud Storage FUSE under the covers without a user needing to install or manage it. The Cloud Storage FUSE CSI driver on your cluster turns on automatic deployment and driver management. The driver works on both Standard and Autopilot clusters.
- Non-privileged access: Previously, you needed to create your own solution to run Cloud Storage FUSE on GKE, which required privileged access. The Cloud Storage FUSE CSI driver does not need privileged access, enabling a better security posture.
- Authentication out of the box: You can use Workload Identity to easily manage authentication while having granular control over how your Pods access Cloud Storage objects.
- Extensive support for accelerators: The Cloud Storage FUSE CSI is supported on all accelerators available on GKE including GPUs and TPUs.
Diagram 2. GKE Pods using the GCS CSI to access cloud storage buckets
For workloads that require file-system semantics, Cloud Storage FUSE CSI support enables:
- ML training (using Pytorch and Tensorflow ) on GKE, including reading data and checkpointing saved models using Cloud Storage as the source of truth
- ML inference models that infer results from files stored in Cloud Storage
- Write and read back checkpoints and saved models to/from Cloud Storage
- Accelerated startup time of your data and AI/ML applications by streaming data dynamically
- Python-based third-party data apps where customers don’t have control over source code
When evaluating Cloud Storage FUSE, please be advised of the following:
- Cloud Storage FUSE and its CSI driver are not alternatives for a fully managed file system such as Filestore. For example, it does not provide concurrency control for multiple writes to the same file, is not fully POSIX compliant, and does not support NFS/CIFS/SMB. Use Filestore with Multi-shares for use cases where multiple writers need to write to the same file, or you need true POSIX support. See here for more information on usage patterns with Cloud Storage Fuse.
- The CSI driver is only supported starting with GKE version 1.26.
- Terraform is not supported as part of the Public Preview but will be supported for GA.
- Official support is available for PyTorch and Tensorflow ML frameworks, while other workloads receive best effort support.
To get started on GKE, simply enable the Cloud Storage Fuse CSI driver on GKE and authenticate (via Workload Identity) with the Cloud Storage bucket that you would like to mount as information in your Pod specification. Learn more here.
To use Cloud Storage FUSE on a VM, please download the Linux package. Learn more here.
By: Akshay Ram (Senior Product Manager) and Marco Abela (Product Manager)
Originally published at: Google Cloud Blog
Our humans need coffee too! Your support is highly appreciated, thank you!