Monitoring & Logging: Transparency for Operations and Security
As systems grow, it’s not just traffic that increases—above all, uncertainty about what is “normal” also grows. Without clean telemetry, failures are discovered randomly, security events remain unclear, and capacity decisions become a matter of gut feeling. Monitoring and logging are therefore not “nice-to-have” features, but rather the basis for controlling operations, performance, and security as a coherent system. This is even more true today, as release cycles become shorter, supply chain dependencies increase, and organizations need more frequent evidence—whether for internal control mechanisms or in the context of standards and programs such as ISO/IEC 2700x, BSI-related procedures, or NIS2-oriented measures. Good observability does not simply create “more data,” but rather signals that enable decision-making: What is critical, what is just loud – and what are we missing in order to manage risks and operating costs cleanly?

Why observability is an operational factor today
In many environments, monitoring and logging have grown historically: a few checks here, a dashboard there, and then “somehow” searching for incidents. The problem is less the choice of tools than the lack of operational logic behind them. Those who view metrics, logs (and traces, if necessary) as a continuous data chain reduce downtime, improve troubleshooting, and gain a reliable basis for capacity and cost decisions. This is particularly crucial under real patch and change pressure: changes happen more frequently, increasing the likelihood that a small configuration drift will later become costly.
For companies, this has a very concrete effect: alerting becomes manageable instead of hectic, root cause analysis becomes reproducible instead of person-dependent, and security events can be better prioritized because they are embedded in operational data. In regulated or audit-sensitive environments, traceability and retention often have to be taken into account – not as bureaucracy, but as part of a stable operating model.
Operating model & ownership

Who really operates the platform—and with what response model? Responsibilities (on-call, ticket, report), uniform namespaces, and a defined “definition of done” for telemetry for new services are crucial. In practice, observability works well when it is thought of as a product: with backlog, standards, and clear interfaces to teams.
Update & Security Capability

Monitoring systems are themselves critical infrastructure. They need patch routines, rights concepts, clean secrets handling, and traceable changes to rules/dashboards. Especially with increasing requirements for verifiability (e.g., ISO/BSI-oriented controls, depending on the industry), it is worthwhile to keep rules and configurations versioned and make deployments reproducible.
Integration, data & lifecycle

How does data from networks, storage, hosts, databases, and applications come together—without media breaks? Exporters/agents, APIs, syslog/journald, OTel collector: The important thing is not so much to “connect everything” as to have a consistent data model and a lifecycle concept (retention, archive, WORM/object lock where appropriate, data protection requirements). Good integration also means connecting to ChatOps/ticket systems so that insights flow into processes – not just into graphics.

Trainings
Specific trainings and current topics can be found in the Comelio GmbH course catalog.
Whether in-house at your company, as a webinar, or as an open event – the formats are flexibly tailored to different requirements.
Typical misconceptions that make observability expensive
“More dashboards = more control”
Many dashboards quickly create the illusion of an overview. In practice, we often see that teams have visualizations but no clear answers to the questions: What is critical? What is normal? Who responds when? Without defined service signals (e.g., error rate, latency, saturation), monitoring becomes wallpaper.
“Alerts are the same as monitoring”
Alerting is only the tip of the iceberg. If alert rules are not linked to user impact and operational consequences, the result is a flood of alerts and desensitization. Modern operational reality means less “paging” on system details, more escalation based on impact—and clean runbooks that are not reinvented in the incident.
“Logging is only for error cases”
Logs are often understood as “error detection.” However, they are just as important for security analyses, audit trails, and change traceability—especially for distributed systems and SaaS/API integrations. With increasing supply chain risks, a traceable pipeline (from creation to archiving) is not a luxury, but a necessity.
“Tool decision first, architecture later”
Prometheus, Grafana, Loki, OpenTelemetry – these are powerful building blocks. But without a data model (labels, taxonomy), ownership, and retention concept, proprietary dependencies or uncontrolled growth will arise, which will be difficult to standardize later on. Precisely because ecosystems move quickly (collectors, agents, exporters, cloud integrations), it is worth focusing on the target architecture first – then the toolset.
Initial consultation / project initiation
If you want to set up or consolidate a new monitoring/logging setup, it is worth having a structured initial consultation: current pain points, target vision, constraints (on-premises/cloud/hybrid), security and retention requirements, and the question of how telemetry will be integrated into operations and delivery processes.
Frequently asked questions about monitoring & logging
In this FAQ, you will find the topics that come up most frequently in consulting and training sessions. Each answer is kept short and refers to further content if necessary. Can’t find your question? We are happy to help you personally.

Prometheus vs. OpenTelemetry – which one should you use for what?
Prometheus is strong for metrics, alerting, and proven operational processes around pull models and rules. OpenTelemetry is a collection and forwarding standard for metrics, logs, and traces, often useful as “glue” across heterogeneous environments. In practice, the combination is common: Prometheus for metrics/alerting, OTel Collector as an integration layer where different sources need to be merged.
How do I avoid alarm flooding and “blind” dashboards?
Start with a few service-oriented signals and design escalation based on impact: paging only for user impact, details more ticket- or report-based. This includes inhibition/silences, clear runbooks, and regular rule reviews. It is crucial that alerting becomes part of the operating model – not just “a few rules.”
What makes monitoring audit- and revision-compliant?
Above all, traceability in the data chain: stable identities, clean time base, defined retention, and documented policies. In addition, there are tamper-resistant storage options when requirements suggest this, as well as regular self-checks/reports. It is also important that changes to rules and dashboards are versioned and reviewable.
Do I also need monitoring in small environments?
Yes—especially there, failures often have an immediate impact on users and operations. The scope does not have to be large: a few core metrics, a clear log pipeline, and manageable alerts are often enough to get started. Growth then means expanding standards, not “starting from scratch.”
