How NordIQ Works

From data collection to early warning alerts. The technology behind predictive infrastructure monitoring, explained for executives and engineers.

The Challenge: Reactive Monitoring is Broken

❌ Traditional Monitoring (What Everyone Else Does)

  • Binary Thresholds: "Alert when CPU > 80%" (but is 80% bad? depends on context!)
  • Reactive: Alerts fire AFTER problems start (too late)
  • No Context: 80% CPU on database = normal, 80% on web server = problem
  • Alert Fatigue: Thousands of alerts, most false positives
  • No Prediction: Can't see problems coming

✅ NordIQ (Predictive Intelligence)

  • Contextual Intelligence: Understands different server workload patterns
  • Predictive: Alerts 30-60 minutes BEFORE problems become critical
  • Multi-Metric Correlation: Sees compound stress (CPU + memory + I/O)
  • Trend Analysis: Detects patterns like memory leaks, gradual degradation
  • Early Warning: Time to respond proactively, not reactively

The 5-Step Process

1️⃣

Data Collection: Monitor 14 Critical Metrics

Every 5 seconds, we collect data from all your servers. Not just "CPU and memory"—we track the metrics that matter for troubleshooting and prediction.

The 14 LINBORG Metrics:

✓ CPU User %
✓ CPU System %
✓ CPU I/O Wait %
✓ CPU Idle %
✓ Java/Spark CPU %
✓ Memory Used %
✓ Swap Usage %
✓ Disk Usage %
✓ Network In (MB/s)
✓ Network Out (MB/s)
✓ Backend Connections
✓ Frontend Connections
✓ Load Average
✓ Uptime (Days)

Why these metrics? They're what SREs look at when troubleshooting. I/O wait tells you about storage bottlenecks. Connection counts reveal network issues. These aren't random—they're battle-tested.

2️⃣

AI Analysis: Temporal Fusion Transformer

Our AI model analyzes patterns using Temporal Fusion Transformers (TFT)—state-of-the-art deep learning for time series forecasting. Not just "if CPU > 80% alert"—we understand trends, correlations, and context.

What TFT Does:

  • 24-hour context window: Remembers what happened over the last day
  • 8-hour prediction horizon: Forecasts 30min, 1hr, 4hr, 8hr ahead
  • Attention mechanism: Knows which metrics matter most (like a senior SRE)
  • Multi-metric correlation: Sees compound patterns (CPU + memory + I/O)
  • Transfer learning: New servers benefit from patterns learned across your fleet

Think of it like this: A junior engineer says "CPU is high." A senior SRE says "CPU is high AND memory is climbing AND I/O wait is elevated—this is a memory leak in the Java app." TFT thinks like the senior SRE.

3️⃣

Risk Scoring: Contextual Intelligence

Every server gets a Risk Score (0-100) based on current state, predictions, trends, and profile. Not just "CPU > 80% = bad"—we understand context.

Risk Score Formula:

Final Risk = (Current State × 70%) + (Predictions × 30%)

The Four Context Factors:

1. Server Profile Awareness

Database at 98% memory = ✅ Healthy (page cache). ML server at 98% memory = 🔴 Critical (OOM imminent).

2. Trend Analysis

40% CPU steady for 2 hours = ✅ Fine. 40% CPU climbing from 20% in 10 minutes = 🟡 Warning (will hit 100%).

3. Multi-Metric Correlation

CPU 85% alone = Low risk (batch job). CPU 85% + Memory 90% + I/O 25% = 🔴 Critical (compound stress).

4. Prediction-Aware

Current 40%, Predicted 95% in 30min = 🟡 Warning (early alert). Current 85%, Predicted 60% = 👁️ Watch (resolving).

4️⃣

Early Warning Alerts: Graduated Severity

Not binary "OK or ON FIRE." We have 4 graduated severity levels that give you time to respond appropriately.

Alert Levels (Risk Score 0-100):

🔴 Critical (70-100) Immediate action required - page on-call
🟠 Warning (40-69) Needs attention - investigate within 1 hour
🟡 Watch (20-39) Minor concerns - trending upward
🟢 Healthy (0-19) Normal operations - no action needed

Key insight: You don't go from "Healthy" to "Critical" instantly. You progress through Watch → Warning → Critical, giving you 30-60 minutes to respond proactively. Early detection = early action.

5️⃣

Proactive Response: Fix Before Impact

With 30-60 minutes advance warning, you can:

  • Schedule planned maintenance during business hours (not 3 AM)
  • Gracefully restart services during low-traffic windows
  • Scale infrastructure before peak demand hits
  • Fix memory leaks before they cause OOM kills
  • Clear disk space before it fills completely
  • Investigate root causes while systems are still operational

Result: Zero user impact. Zero emergency pages. Zero SLA violations.

Problems prevented, not fought.

Real-World Example: Memory Leak Detection

Scenario: ML Training Server (ppml0042)

T-60 minutes: NordIQ detects pattern

  • Current memory: 72%
  • Predicted memory (30min): 94%
  • Trend: Climbing 2% every 5 minutes (memory leak!)
  • Risk Score: 58 (🟠 Warning)
  • Alert: "ppml0042 memory leak detected, OOM in ~1 hour"

T-55 minutes: Ops team investigates

  • Correlates with deployment 2 hours ago
  • Identifies problematic training job
  • Decision: Graceful restart during low-usage window (11 PM)

T-50 minutes: Action taken

  • Schedule automated restart for 11 PM
  • Notify affected teams
  • Prepare rollback plan

11:05 PM: Automated restart

  • Service restarted during low-traffic window
  • Memory drops back to 45%
  • Total downtime: 2 minutes (planned)

3:47 AM: Everyone sleeps soundly

  • No pager alerts
  • No customer impact
  • No incident report needed
  • Cost: $0 lost, 0 customers impacted

That's the power of predictive monitoring.

Profile-Based Transfer Learning

How new servers get accurate predictions immediately—no retraining required

The Problem with Traditional ML

Most AI treats every server as unique. Add a new server → retrain the model. Remove a server → retrain the model. Change fleet size → retrain everything. It's expensive, slow, and impractical.

Our Solution: 7 Server Profiles

We recognize that servers of the same type behave similarly. A new ML training server (ppml0099) behaves like existing ML servers (ppml0001-ppml0098). So we learn patterns at the profile level, not individual server level.

ML Compute (ppml####)
High CPU/memory during training, Memory-intensive workloads
Database (ppdb###)
High memory (page cache normal), Query CPU spikes expected
Web API (ppweb###)
Low memory (stateless), Latency-sensitive, Network-heavy
Data Ingest (ppdi###)
High disk I/O, Streaming workloads, Network-intensive
Conductor/Risk/Generic
Specialized workloads and fallback category

Benefits

Instant Intelligence

New servers get strong predictions from day 1. No waiting for training data.

🔄

80% Less Retraining

Add/remove servers without retraining. Only retrain when patterns shift.

📈

13% Better Accuracy

Profile-specific knowledge beats one-size-fits-all models.

System Architecture

Microservices Design

1. Production Adapters

Pull metrics from your sources (MongoDB, Elasticsearch, Prometheus) every 5 seconds. Send to inference daemon via REST API.

2. Inference Daemon

GPU-accelerated TFT model. <100ms latency per server. REST API (port 8000) + WebSocket streaming. Real-time predictions.

3. Dashboard

Interactive web UI (port 8501). Fleet overview, heatmap, alerts, trends, cost avoidance calculator. Updates in real-time via WebSocket.

4. Adaptive Retraining

Drift detection monitors accuracy. Auto-retrains when needed (not on fixed schedule). Validates before deployment.

The Technology Stack

🧠 AI Model

Temporal Fusion Transformers (TFT)
Google Research, 2019. State-of-the-art for time series. 88K parameters, <100ms inference.

⚡ Runtime

PyTorch + GPU Acceleration
NVIDIA RTX optimization. Batch processing. <2s dashboard load time.

🔧 Framework

Python + Streamlit
Production-ready. Microservices. REST + WebSocket APIs. Easy integration.

📊 Data

Parquet + Time Series
14 LINBORG metrics. 5-second granularity. Efficient storage and querying.

Ready to See It in Action?

Get a personalized demo and see NordIQ predict failures in your infrastructure.

Request a Demo View Pricing

Questions? Email craig@nordiqai.io

Buy me a coffee