Block storage: Ceph/ZFS, performance, recovery
Block storage is rarely just about “capacity.” In virtualization, Kubernetes, and data-intensive workloads, it determines whether changes remain predictable or whether every maintenance task becomes a risk. Precisely because release cycles are becoming shorter, patch windows are narrower, and failure and restart scenarios increasingly have to be explained to audits and internal control requirements, it is worth taking a sober look: Which storage model suits your workloads – and your business?
Modern open-source stacks such as Ceph RBD and ZFS are not “cheap substitutes” for SAN, but architectural options with clear strengths: traceable data paths, automatable operation, clean separation of failure domains, and measurable performance. The decisive factor is the interaction between hardware, network, workload profile, and ownership – in other words, exactly what determines stability, speed, and costs in everyday use.

Business relevance
Block storage is a lever for three things that companies currently need at the same time: reliable availability, controllable changes, and predictable performance. If storage is the bottleneck, it affects not only “IT,” but also time-to-delivery, incident load, and ultimately the cost per service. In practice, this is often seen in indirect symptoms: VM migrations take too long, stateful workloads in Kubernetes become special cases, backups are done “somehow,” but no one trusts the restore.
Added to this is the operational reality: supply chain risks and ransomware patterns are shifting the focus away from pure “uptime” thinking toward recoverability and traceability. A storage design that considers snapshots, replication, immutable backups, and tested recovery paths as standard reduces the risk organizationally – not just technically.
Operating model & ownership

Who operates storage on a daily basis – and how is knowledge made team-compatible? A distributed system like Ceph needs clear roles (platform/storage/network), clear responsibilities for upgrades, capacity planning, and incident routines. A ZFS-centric design can often keep ownership leaner, but requires discipline in replication, snapshot policies, and hardware standardization.
Update & Security Capability

Block storage is part of the supply chain: kernel, firmware, NIC drivers, orchestrator integrations. The key is whether updates run reproducibly (staging, maintenance window, rollback plan) and whether telemetry shows early on when latencies drift or rebuilds get out of hand. The current patch pressure makes “we’ll update sometime” a hidden availability risk.
Integration, data & lifecycle

How are volumes provisioned (hypervisor, CSI, iSCSI/NVMe-oF), how are backups integrated, and what do data lifecycles look like (retention, snapshots, clones, immutable offsite)? A typical trade-off: maximum flexibility with self-service provisioning versus strict standards that simplify operation and auditability. Deciding early on saves later discussions about “exceptions” that become entrenched in storage over the years.

Trainings
Specific trainings and current topics can be found in the Comelio GmbH course catalog.
Whether in-house at your company, as a webinar, or as an open event – the formats are flexibly tailored to different requirements.
Typical misunderstandings
“More IOPS solve the problem”
IOPS are important, but without a latency profile, queuing behavior, network path, and appropriate caching strategy, this number is meaningless. Performance disappointment is often caused not by too few SSDs, but by inappropriate failure domains, incorrect tuning, or a network that was never really dimensioned for storage traffic – a classic problem since NVMe-oF and 25/40/100G became “normal” in data centers.
“Ceph is always the scalable answer”
Ceph is powerful when you really want to scale out (capacity, throughput, fault tolerance) and when operation and monitoring are disciplined. For small, manageable environments or very latency-critical single-server workloads, ZFS (or a deliberately simple design) may be the better overall solution – fewer moving parts, less operational overhead.
“ZFS is just a file system”
In many setups, ZFS is more of a data integrity and snapshot framework than “just” a file system. If you set up ZFS design (vdevs, record size, sync policy), replication, and backup paths cleanly, you get a very explainable storage standard – especially where skills and operating concepts are more important than horizontal scaling.
“HA replaces DR”
High availability helps against node failures and maintenance. Disaster recovery addresses other classes of events: logical corruption, encryption damage, location problems, or fatal operating errors. Especially against the backdrop of the current reality of attacks, it is risky to interpret HA as a “safety net” if restore tests and recovery playbooks are missing.
Initial consultation / project initiation
If there is already a specific need – such as a new storage design, modernization of existing environments, or performance/stability issues – a structured initial consultation is advisable.
This allows the workload profile, operating model, risks, and sensible options to be clearly classified without prematurely stacking technology on top of technology.
Frequently asked questions about block storage
In this FAQ, you will find the topics that come up most frequently in consulting and training sessions. Each answer is kept short and refers to further content if necessary. Can’t find your question? We are happy to help you personally.

How can I tell if ZFS is “enough” or if I need a distributed system?
If scaling across many nodes, fault tolerance across failure domains, and simultaneous multi-workload usage are your primary concerns, then a distributed system is the way to go. If, on the other hand, you operate a small number of systems in a highly controlled manner and latency and simplicity are key, then ZFS is often the more pragmatic choice.
What role does the network really play in block storage?
A central one. Many storage problems are actually network problems: inconsistent MTU, buffering, faulty LACP configurations, or too few separate paths. Especially with NVMe-oF or highly parallel workloads, the network path becomes the dominant factor.
How important are restore tests compared to “classic” HA?
Very important, because HA covers different error classes than recovery from corruption, operator errors, or security incidents. Regular restore drills and documented runbooks are often the difference between “backup exists” and “recovery works.”
Can I automate Ceph/ZFS in a meaningful way without introducing new risks?
Yes – if automation is understood as standardization: versioned configuration, reproducible rollouts, clear metrics, and a defined change process. It becomes risky when automation prioritizes “speed” but ownership, monitoring, and rollback are lacking.
