DevOps & Cloud

Finally, a Way to Snapshot Multiple Volumes Without Losing Your Mind

A

Admin User

Author

Jun 14, 2026
4 min read
1 views

I remember the exact moment this problem bit me hard. We were running a multi-tier application on Kubernetes with data spread across three volumes—one for the database, one for logs, and one for cache metadata. A customer asked for a point-in-time recovery, and we had to quiesce the entire application, manually snapshot each volume in sequence, pray nothing changed between snapshots, and then coordinate the restore. It took hours and felt absolutely wrong. I remember thinking: "There has to be a better way." Apparently, Kubernetes engineers have been thinking the exact same thing.

With Kubernetes v1.36, volume group snapshots have finally reached General Availability. This isn't just a feature bump—it's solving a real production problem that I've felt firsthand. For years, we've had snapshot APIs for individual volumes, but the moment you need to snapshot multiple volumes consistently, the system falls apart. Today, that changes.

What Volume Group Snapshots Actually Solve

The core issue is crash consistency across multiple volumes. When you have a distributed application with data split across different persistent volumes, a single point-in-time snapshot of all of them together is worth its weight in gold. Without this, you're stuck either quiescing your application (expensive and slow) or taking individual snapshots at different times (which means your recovery point is broken).

Think of it like backing up a database with transaction logs. If you back up the database at time T1 and the logs at time T2, you've got an inconsistent state. You need both at the same moment. Kubernetes now gives us a native way to do this through label selectors and a unified API.

The implementation relies on three CRD objects: VolumeGroupSnapshot (the user's request), VolumeGroupSnapshotContent (the actual cluster resource), and VolumeGroupSnapshotClass (the configuration). It's a clean API design that mirrors how individual volume snapshots work.

How This Works in Practice

The mechanics are straightforward. You label your PersistentVolumeClaims with a selector, define a VolumeGroupSnapshotClass pointing to your CSI driver, and create a VolumeGroupSnapshot manifest that groups them together.

# Step 1: Label your PVCs
kubectl label pvc pvc-data group=myapp-backup
kubectl label pvc pvc-logs group=myapp-backup
kubectl label pvc pvc-cache group=myapp-backup

# Step 2: Define the snapshot class
apiVersion: groupsnapshot.storage.k8s.io/v1
kind: VolumeGroupSnapshotClass
metadata:
  name: production-group-snapshots
driver: ebs.csi.aws.com
deletionPolicy: Delete

# Step 3: Request the group snapshot
apiVersion: groupsnapshot.storage.k8s.io/v1
kind: VolumeGroupSnapshot
metadata:
  name: backup-2026-01-15
  namespace: production
spec:
  volumeGroupSnapshotClassName: production-group-snapshots
  source:
    selector:
      matchLabels:
        group: myapp-backup

When you restore, you request new PVCs from individual snapshots that are part of the group. The beauty is that the snapshot controller handles finding all the labeled volumes and snapshotting them together—no manual orchestration needed.

My Take: The Good and the Gaps

I'm genuinely excited about this reaching GA. The API design is clean, the problem it solves is real, and the label-selector approach is idiomatic Kubernetes. This is the kind of feature that makes infrastructure code feel less like a hack and more like a system.

That said, a few things still make me cautious. First, this only works with CSI drivers. If your organization is still using in-tree volume plugins or older storage systems, you're out of luck. Second, the restore process still requires you to manually create PVCs from individual snapshots in the group. It would be nice to have a helper that reconstructs all volumes from a group snapshot with one manifest.

I also wonder about observability. How do you monitor whether a group snapshot actually captured all intended volumes? What happens if one PVC fails to snapshot while others succeed? The documentation hints at this but doesn't deep-dive into failure scenarios.

What This Means for My Work

Honestly, this changes how I'll design multi-volume workloads going forward. Instead of working around the snapshot limitation, I can now architect with confidence that backup and recovery will be atomic across all volumes. For stateful applications—databases with separate log volumes, cache systems with metadata volumes—this is transformative.

The feature requires CSI driver support though, which means I'll need to verify that our storage vendor (or cloud provider) has implemented the group controller RPCs. That's the real gate here, not the Kubernetes feature itself.

Next Steps

If you're running Kubernetes v1.36 or later and use CSI-backed storage, I'd recommend experimenting with group snapshots in a dev environment first. Check if your CSI driver supports it, and think through your multi-volume topology. Where would atomic snapshots actually help you? That's where you'll see real value.

What's your biggest pain point with snapshots today? Are you managing multiple volumes that really need to be backed up together?

Source: This post was inspired by "Kubernetes v1.36: Moving Volume Group Snapshots to GA" by Kubernetes Blog. Read the original article

Share this article

Related Articles

DevOps & Cloud Jun 8

Kubernetes v1.36 Made Me Rethink How We Authorize Kubelet Access

I was debugging a monitoring issue at 2 AM last week when it hit me: we'd granted our observability stack permissions that were way too broad on the kubelet API. The permission model was essentially "here, have access to everything kubelet-related" because we needed visibility in...

DevOps & Cloud Jun 9

Why I Finally Stopped Fighting Kubernetes UIs (and Why You Should Care)

Last year, I was debugging a pod crash in production at 11 PM and found myself switching between the Kubernetes Dashboard and kubectl because the Dashboard couldn't show me the full picture across our two clusters. I remember thinking: "This tool was fine when we had one cluster...