Block storage: Ceph/ZFS, performance, recovery

Block storage is rarely just about “capacity.” In virtualization, Kubernetes, and data-intensive workloads, it determines whether changes remain predictable or whether every maintenance task becomes a risk. Precisely because release cycles are becoming shorter, patch windows are narrower, and failure and restart scenarios increasingly have to be explained to audits and internal control requirements, it is worth taking a sober look: Which storage model suits your workloads – and your business?

Modern open-source stacks such as Ceph RBD and ZFS are not “cheap substitutes” for SAN, but architectural options with clear strengths: traceable data paths, automatable operation, clean separation of failure domains, and measurable performance. The decisive factor is the interaction between hardware, network, workload profile, and ownership – in other words, exactly what determines stability, speed, and costs in everyday use.

Logistics Comeli dragon in front of stacked boxes—symbolizing block storage, storage structures, and scalable data retention.

Business relevance

Block storage is a lever for three things that companies currently need at the same time: reliable availability, controllable changes, and predictable performance. If storage is the bottleneck, it affects not only “IT,” but also time-to-delivery, incident load, and ultimately the cost per service. In practice, this is often seen in indirect symptoms: VM migrations take too long, stateful workloads in Kubernetes become special cases, backups are done “somehow,” but no one trusts the restore.

Added to this is the operational reality: supply chain risks and ransomware patterns are shifting the focus away from pure “uptime” thinking toward recoverability and traceability. A storage design that considers snapshots, replication, immutable backups, and tested recovery paths as standard reduces the risk organizationally – not just technically.

Operating model & ownership

Comeli represents an operating model and clear ownership - making responsibility and operations measurable.

Who operates storage on a daily basis – and how is knowledge made team-compatible? A distributed system like Ceph needs clear roles (platform/storage/network), clear responsibilities for upgrades, capacity planning, and incident routines. A ZFS-centric design can often keep ownership leaner, but requires discipline in replication, snapshot policies, and hardware standardization.

Update & Security Capability

Comeli as a boxer - security capability through hardening, patching, and risk reduction.

Block storage is part of the supply chain: kernel, firmware, NIC drivers, orchestrator integrations. The key is whether updates run reproducibly (staging, maintenance window, rollback plan) and whether telemetry shows early on when latencies drift or rebuilds get out of hand. The current patch pressure makes “we’ll update sometime” a hidden availability risk.

Integration, data & lifecycle

Comeli on safari - keeping integration, data, and lifecycle in view: authentication, logging, CI/CD.

How are volumes provisioned (hypervisor, CSI, iSCSI/NVMe-oF), how are backups integrated, and what do data lifecycles look like (retention, snapshots, clones, immutable offsite)? A typical trade-off: maximum flexibility with self-service provisioning versus strict standards that simplify operation and auditability. Deciding early on saves later discussions about “exceptions” that become entrenched in storage over the years.

The Comeli dragon is teaching at the blackboard at ComelioCademy.

Specific trainings and current topics can be found in the Comelio GmbH course catalog.
Whether in-house at your company, as a webinar, or as an open event – the formats are flexibly tailored to different requirements.

Typical misunderstandings

“More IOPS solve the problem”

IOPS are important, but without a latency profile, queuing behavior, network path, and appropriate caching strategy, this number is meaningless. Performance disappointment is often caused not by too few SSDs, but by inappropriate failure domains, incorrect tuning, or a network that was never really dimensioned for storage traffic – a classic problem since NVMe-oF and 25/40/100G became “normal” in data centers.

“Ceph is always the scalable answer”

Ceph is powerful when you really want to scale out (capacity, throughput, fault tolerance) and when operation and monitoring are disciplined. For small, manageable environments or very latency-critical single-server workloads, ZFS (or a deliberately simple design) may be the better overall solution – fewer moving parts, less operational overhead.

“ZFS is just a file system”

In many setups, ZFS is more of a data integrity and snapshot framework than “just” a file system. If you set up ZFS design (vdevs, record size, sync policy), replication, and backup paths cleanly, you get a very explainable storage standard – especially where skills and operating concepts are more important than horizontal scaling.

“HA replaces DR”

High availability helps against node failures and maintenance. Disaster recovery addresses other classes of events: logical corruption, encryption damage, location problems, or fatal operating errors. Especially against the backdrop of the current reality of attacks, it is risky to interpret HA as a “safety net” if restore tests and recovery playbooks are missing.

Frequently asked questions about block storage

In this FAQ, you will find the topics that come up most frequently in consulting and training sessions. Each answer is kept short and refers to further content if necessary. Can’t find your question? We are happy to help you personally.

Comeli dragon leans against a “FAQ” sign and answers questions about block storage.

If scaling across many nodes, fault tolerance across failure domains, and simultaneous multi-workload usage are your primary concerns, then a distributed system is the way to go. If, on the other hand, you operate a small number of systems in a highly controlled manner and latency and simplicity are key, then ZFS is often the more pragmatic choice.

A central one. Many storage problems are actually network problems: inconsistent MTU, buffering, faulty LACP configurations, or too few separate paths. Especially with NVMe-oF or highly parallel workloads, the network path becomes the dominant factor.

Very important, because HA covers different error classes than recovery from corruption, operator errors, or security incidents. Regular restore drills and documented runbooks are often the difference between “backup exists” and “recovery works.”

Yes – if automation is understood as standardization: versioned configuration, reproducible rollouts, clear metrics, and a defined change process. It becomes risky when automation prioritizes “speed” but ownership, monitoring, and rollback are lacking.