Virtual machines with KVM: Stable operation with Proxmox, libvirt, and virsh

Virtual machines continue to be the most reliable foundation for business applications in many companies—because separation, predictable updates, and auditable processes are more important than “even more automation.” KVM (QEMU/libvirt) is a robust, open foundation for this: easy to document, easy to standardize, and team-friendly without requiring individual expertise.

At the same time, the bar has been raised: patch cycles, supply chain risks, and audit requirements (e.g., NIS2-oriented security programs or BSI-related documentation requirements depending on the industry) mean that “it kind of works” is no longer enough. The decisive factor is therefore not whether a VM starts up in seconds, but whether the architecture, ownership, storage, network, and security baselines together form an operating system that remains stable even in the face of growth, team changes, and disruptions.

A cloud with a monitor, representing virtualization with KVM (QEMU) and Proxmox VE.

Operational stability & risk exposure

VMs create isolation that matters in everyday life: Workloads can be cleanly separated, resources can be allocated in a predictable manner, and changes can be introduced in a controlled manner. At the same time, virtualization itself has become a greater focus for attackers—not as a theory, but as an operational reality: If a hypervisor vulnerability is actively exploited, in the worst case scenario, it affects many workloads on a host simultaneously. This is precisely why patchability and maintainability (host, management stack, firmware, driver chain) is now a business issue—not just “IT hygiene.” As a tangible example: CISArecently confirmed that a VMware ESXi vulnerability (CVE-2025-22225) is being actively exploited in ransomware campaigns.

Costs & changeability

Virtualization is not automatically “slow” – above all, it is predictable. With the right operating model and automation, provisioning, capacity planning, and standard services (e.g., internal platform services, staging environments, training/test systems) can be set up, operated, and dismantled in a reproducible manner. The current security pressure at hypervisor levels (see actively exploited ESXi vulnerability) is a good reality check for cost-effectiveness: Quickly adding “more VMs” is only an advantage if update capability, change processes, and backup/restore really work in sync – otherwise, speed becomes a hidden operational liability.

The Comeli dragon is teaching at the blackboard at ComelioCademy.

Specific trainings and current topics can be found in the Comelio GmbH course catalog.
Whether in-house at your company, as a webinar, or as an open event – the formats are flexibly tailored to different requirements.

Operating model & ownership

Comeli represents an operating model and clear ownership - making responsibility and operations measurable.

Who operates the platform in day-to-day business, who makes decisions in the event of incidents, and how are changes approved? Proxmox VE scores highly where roles, cluster mechanics, and standard workflows relieve the burden on teams. libvirt/virsh is strong when a lean, controlled stack is desired—but then standards (templates, naming, change process) must be consciously set.

Trade-off: Less platform convenience vs. less complexity and more transparency in the underlying structure.

Update & Security Capability

Comeli as a boxer - security capability through hardening, patching, and risk reduction.

Can the environment process updates regularly “in sync” – including firmware, host packages, hypervisor components, and management tools? Current realities such as coordinated patch cycles and supply chain risks have a direct impact here: image sources, repo policies, signatures, secrets handling, and hardening profiles are not optional.

Trade-off: Maximum up-to-dateness vs. maximum stability – this is usually solved through staging, canary hosts, and clear maintenance windows rather than by “never touching” anything.

Integration, data & lifecycle

Comeli on safari - keeping integration, data, and lifecycle in view: authentication, logging, CI/CD.

Storage backend (Ceph, ZFS, LVM-Thin), network architecture (bridges, VLANs, overlay if necessary), backup strategy, and observability must all fit together. Ceph can deliver scalability and true cluster redundancy, ZFS is often very attractive in smaller clusters with heavy snapshot workflows, local backends are fast – but shift HA issues to other levels.

Trade-off: Operational effort and monitoring depth vs. availability/mobility (e.g., live migration without shared NFS).

Typical misconceptions that slow down projects

“Proxmox is just a GUI – it doesn’t matter what’s underneath.”

The opposite is true: The web interface makes it easier to use, but the crucial questions remain the same underneath: How are the network and storage decoupled? How are updates orchestrated? What does the role model look like? Especially when security requirements (e.g., CIS benchmark-oriented hardening or industry-specific controls) come into play, a clear foundation is needed—regardless of the interface.

“libvirt/virsh is always better because it is scriptable.”

Scriptability is an advantage, but not automatically an operating model. Without defined conventions (naming, templates, image sources, secrets handling, runbooks), “maximum flexibility” quickly becomes “maximum individuality.” In heterogeneous teams, the question is rather: Where does a platform (cluster, HA, backup, roles) help, and where is the lean libvirt layer the right choice?

“Ceph solves HA – then everything is redundant.”

Ceph can provide true redundancy, but it is not a free pass. Network design, failure domains, monitoring depth, and operating routines determine whether redundancy is real or only exists on paper. Given current ransomware patterns (lateral movement, credential theft, backup attacks), it is also important to consider recovery processes and immutable/air gap concepts – not just availability.

“Live migration = maintenance without risk.”

Live migration is powerful, but it is no substitute for clean compatibility rules (CPU flags, storage health, quorum/fencing, maintenance mode) and tested runbooks. In practice, failures often do not occur with the feature itself, but in borderline cases: partial degradation, quorum problems, inconsistent storage paths, or unclear ownership when making incident decisions.

Frequently asked questions about virtual machines

In this FAQ, you will find the topics that come up most frequently in consulting and training. Each answer is kept short and refers to further content if necessary. Can’t find your question? We are happy to help you personally.

Comeli dragon leans against a “FAQ” sign and answers questions about virtual machines.

Sustainable means: templates/images are versioned, cloud-init/preseed is standardized, configuration is idempotent (e.g., via Ansible), and drift is visible. In addition, secrets, repo policies, and signatures should be anchored in the process – keyword: supply chain reality. This creates a “golden path” that scales in teams.

VMs are often the stable layer for network boundaries, storage policies, and identity-related components, while container orchestration remains flexible above them. The key is clear zoning (management vs. workload), clean observability, and a lifecycle concept that makes upgrades plannable at both levels – instead of hiding dependencies.

The starting point is HA goals, operating costs, monitoring capability, and restore strategy—not just benchmarks. Ceph offers true distributed redundancy, but requires operational discipline and good network design. ZFS often impresses with snapshot workflows and simpler operation in smaller clusters. Local backends are fast and simple, but shift availability to other levels.