Availability: predictable rather than random

Availability is not a feature that is “just there” as soon as two servers are set up somewhere. In reality, it results from architectural decisions that prove themselves in everyday use: clear redundancy paths, automated switchover logic, clean monitoring chains, and recoverability that works even under stress. This is precisely where the bar has shifted in recent years: shorter patch cycles, more frequent provider and supply chain disruptions, and rising expectations for verifiability (depending on the industry, e.g., NIS2-oriented security programs or BSI-related procedures) quickly make “best effort” expensive.

The practical benefit is rarely “99.99%” as a number, but rather fewer unplanned interruptions, more predictable maintenance windows, and an operational organization that does not depend on the implicit knowledge of individuals, even when teams change. Availability thus becomes a predictable quality – not a reaction to the next incident.

Service Comeli dragon carrying packages—symbolizing availability, reliable delivery, and stable IT services.

Why availability determines costs and speed today

In many environments, the greatest damage caused by a failure is not the technical defect itself, but the chain reaction: orders stop, interfaces become overwhelmed, manual workarounds are created, and in the end, an infrastructure problem becomes a process problem. Especially when security and compliance requirements are increasing in parallel (e.g., through more auditable operating processes), availability automatically becomes a management issue: Those who cannot roll out changes in a controlled manner lose speed – and increase risk.

From a business perspective, three levers work together: First, clean redundancy reduces unplanned downtime. Second, automation lowers the “MTTR in your head” (i.e., the time it takes to figure out what to do). Third, observability creates the basis for identifying problems early on, rather than only reacting when users escalate them. In practice, this pays off especially during maintenance work: When failover, rollback, and health checks are part of the operating model, changes become routine again instead of a thrill.

Operating model & ownership

Comeli represents an operating model and clear ownership - making responsibility and operations measurable.

Availability rarely fails because of technology, but more often because of unclear responsibilities. Who decides on failover policy, maintenance windows, patch sequences, escalation paths, and emergency access? In regulated or audit-driven environments, this question becomes even more relevant because roles, approvals, and evidence are part of operations.

A viable model clearly separates responsibilities: platform/infrastructure for basic services, application teams for SLO-related requirements, security/compliance for guidelines. It is important to translate this into routine: who is allowed to do what, how is it documented, how is it tested, how are lessons learned after an incident – without assigning blame.

Update & security capability

Comeli as a boxer - security capability through hardening, patching, and risk reduction.

Availability without update capability is a high-interest loan: the longer you postpone patches, the greater the risk and change leap. In times of condensed patch reality (kernel, hypervisor, container stack, libraries, firmware), stability depends on whether updates are small, frequent, and reproducible.

This includes defined maintenance windows, staging/canary mechanisms, health checks, rollback strategies, and a clear “definition of done” for changes. Setting up update processes cleanly not only reduces security risks, but also the likelihood of failure due to hectic ad hoc interventions.

Integration, Data & Lifecycle

Comeli on safari - keeping integration, data, and lifecycle in view: authentication, logging, CI/CD.

Many availability concepts work on the whiteboard – until lifecycle and integrations strike: end-of-support, driver/firmware versions, dependencies on storage backends, certificate chains, API rate limits at providers, or “hidden” single points such as a central IAM/LDAP.

A robust design therefore explicitly takes data paths and lifecycles into account: Which components must be synchronous, which can be asynchronous? Where is consistency more important than throughput? Which dependencies are critical, which are convenient? And what happens if a provider endpoint, registry, or DNS service is temporarily unavailable?

The Comeli dragon is teaching at the blackboard at ComelioCademy.

Specific trainings and current topics can be found in the Comelio GmbH course catalog.
Whether in-house at your company, as a webinar, or as an open event – the formats are flexibly tailored to different requirements.

Typical misconceptions that slow down availability

“High availability replaces backups.”

A cluster can continue to provide services – but it cannot rescue logically incorrect data, accidentally deleted objects, or creeping corruption. Especially with ransomware patterns that increasingly rely on lateral movement and backup sabotage, a recovery strategy that works independently of the running system is needed: versioning, separate permissions, and regularly practiced restores.

“Monitoring prevents failures.”

Monitoring detects what is already going wrong or is about to go wrong. Failures are only prevented by measures such as capacity limits, clean update and rollout processes, redundancy paths, and automated responses. Those who introduce observability without operational routines often only get “better alerting” – but not a more stable platform.

“Active-active is always better than active-passive.”

Active-active sounds attractive, but it increases complexity where consistency matters: databases, state, sessions, queues. In practice, active-passive with clear switchover logic is often more robust, easier to test, and simpler to operate—especially when teams are not permanently deep in the cluster stack.

“Redundancy means everything twice.”

Duplicate hardware without clean paths is an expensive illusion. End-to-end data paths are crucial: network, storage, name services, certificates, routing, DNS, secrets, automation. Especially in hybrid setups (on-premises plus cloud/SaaS), single points of failure often arise at integration points – not in the server rack.

Frequently asked questions about availability

In this FAQ, you will find the topics that come up most frequently in consulting and training sessions. Each answer is kept short and refers to further content if necessary. Can’t find your question? We are happy to help you personally.

Comeli dragon leans against a “FAQ” sign and answers questions about availability.

Usually not. HA reduces the downtime of individual components, but does not automatically protect against logical errors, corruption, or misconfigurations. For critical systems, the combination of HA, resilient backups, and practiced restore processes is crucial.

That depends primarily on state and consistency requirements. Stateless workloads often benefit from active-active, while stateful systems are often more robust and easier to test with active-passive. It is important that failover/failback is actually rehearsed – not just documented.

Monitoring is observation, not prevention. Availability only comes about when observability is combined with capacity limits, update discipline, automation, and clear response routines. Without these building blocks, alarms tend to be louder rather than more helpful.

As often as necessary for the team to master the process and ensure that changes (software, infrastructure, permissions) do not compromise recoverability. In practice, it makes sense to establish a fixed rhythm that is anchored in operational routines – including documented results.