Jan 2026

The Next Generation of Observability

How real-time analytics and agents are reshaping the future of observability

Today's applications are generating more telemetry than ever before, and it's breaking the tools built to monitor them. As infrastructure becomes increasingly cloud-native, distributed, and AI-driven, the traditional observability stack is collapsing under the sheer scale and velocity of modern workloads. What began as a way to give engineers confidence when systems failed has become one of the most painful and expensive layers in the modern infrastructure stack.

A new generation of observability software is emerging, powered by real-time data infrastructure and AI-native automation. These systems move beyond passive monitoring, using live telemetry to detect issues, understand their root causes, and take action autonomously to maintain system health. In this shift, observability is evolving from a reactive cost center to a dynamic source of intelligence and competitive advantage.

The Two Fractures in Traditional Observability

At the data layer, the shift to cloud, microservices, and now AI has dramatically changed the shape and volume of telemetry. Modern systems now consist of thousands of distributed components, each emitting its own stream of logs, metrics, and traces. The result is an explosion in both data volume and complexity, transforming the core observability workload from fundamentally a search problem, finding the proverbial "needle in the haystack", into a real-time analytics challenge, where billions of signals must be correlated to form a coherent picture of system behavior.

Legacy observability architectures, built for a different era, have hit fundamental scaling and cost limitations. Coupled with their rigid ecosystems and ingest-based pricing, customers are forced into an impossible tradeoff: observe less or spend more. Coinbase reportedly spent $65 million on Datadog in 2022, while OpenAI's annual observability bill is rumored to approach $170 million. Security teams face similar strain, paying seven- to eight-figure Splunk bills simply to retain logs.

At the same time, the operational capacity of DevOps and security operations centers has reached a breaking point. Incident response teams are paged at 2 a.m. to search through oceans of logs to pinpoint the source of failures; while security analysts are buried under anxiety-inducing dashboards and a constant stream of alerts, many of which are false-positives or redundant. This isn't just a visibility problem; it's an execution one, where the surface area of infrastructure has expanded beyond what humans can reasonably maintain and secure.

Together, the growing misalignment between what modern infrastructure teams need and what legacy tools can deliver has exposed the two core fractures of observability: traditional data architectures are collapsing under the scale of today's workloads, while DevOps and security teams are trapped in an endless cycle of reactive firefighting, manual triage, and operational fatigue.

The Next Generation Observability Stack

1. The Real-Time Foundation

At its core, the next-generation observability stack is being rearchitected on real-time analytics infrastructure purpose-built for modern telemetry. Unlike the time-series or search-based architectures of the past, real-time analytics engines like ClickHouse combine a columnar architecture, high-throughput ingestion, and petabyte-scale storage efficiency that together enable ultra-fast analytical queries over continuous streams of structured data. This architecture enables teams to analyze vast volumes of telemetry in real time, delivering holistic visibility at a fraction of the cost and latency of traditional systems.

2. AI-Native Application Layer

Yet visibility alone isn't enough. The sheer number of services, alerts, and incidents has far outpaced what humans can reasonably maintain. What's emerging at the application layer instead is a new class of AI-native platforms that reason over observability data to automate the triage, analysis, and response workflows of infrastructure teams.

Instead of relying on engineers to sift through telemetry, these systems are deploying AI agents that autonomously detect, diagnose, and remediate issues in real time. Early blueprints of the AI-native experience layer are already taking shape:

  • AI SRE (Site Reliability Engineering) – Companies like Traversal, Resolve, and Cleric are building autonomous incident responders that traverse dependency graphs, join logs and metrics, and pinpoint causal patterns in seconds—collapsing incident response times and turning firefighting into continuous optimization.
  • AI SOC (Security Operations Centers) – Platforms such as 7AI, Legion Security, Dropzone, and Prophet Security are transforming the SOC from a reactive command center to a proactive defense layer, using agents that triage, enrich, and correlate alerts to draft reports and automate remediation workflows.

In this new paradigm, DevOps and SOC teams are freed from the burden of manual investigation and instead spend their time upstream, orchestrating agents and solving higher-impact system engineering challenges.

From Visibility to Intelligence

I'm excited about this next-generation observability stack: the companies building both the real-time data infrastructure and the intelligent applications that elevate observability from reactive firefighting to AI-driven automation and continuous improvement.

At the data layer, ClickHouse's real-time engine has become the backbone for modern telemetry. Today, ClickHouse already underpins observability pipelines for some of the world's most data-intensive organizations: powering petabyte-scale monitoring at OpenAI, secure model-training observability for Anthropic, Tesla's internal metrics platform, and Netflix's logging system. As we enter the AI-native era, this real-time substrate is becoming indispensable as telemetry becomes the raw ingredient for intelligence. AI agents will generate constant streams of iterative queries as agents explore data, introducing an entirely new access pattern for databases. ClickHouse's architecture, uniquely capable of sustaining thousands of concurrent, low-latency interactions, is well-positioned to become the de-facto, real-time analytics engine feeding agentic decision-making.

Above this foundation, I'm equally energized by the teams building the application layer — AI-native platforms that transform observability from a reactive cost center into a continuous source of reliability. As agents begin anticipating issues, optimizing performance, and keeping modern systems aligned with business intent, observability returns to being a true layer of confidence—where real-time data and intelligent automation work together to keep infrastructure continuously healthy and secure.