How It Works | NordIQ AI - Predictive Monitoring Technology Explained

The Challenge: Reactive Monitoring is Broken

❌ Traditional Monitoring (What Everyone Else Does)

Binary Thresholds: "Alert when CPU > 80%" (but is 80% bad? depends on context!)
Reactive: Alerts fire AFTER problems start (too late)
No Context: 80% CPU on database = normal, 80% on web server = problem
Alert Fatigue: Thousands of alerts, most false positives
No Prediction: Can't see problems coming

✅ NordIQ (Predictive Intelligence)

Contextual Intelligence: Understands different server workload patterns
Predictive: Alerts 30-60 minutes BEFORE problems become critical
Multi-Metric Correlation: Sees compound stress (CPU + memory + I/O)
Trend Analysis: Detects patterns like memory leaks, gradual degradation
Early Warning: Time to respond proactively, not reactively

The 5-Step Process

1️⃣

Data Collection: Monitor 14 Critical Metrics

Every 5 seconds, we collect data from all your servers. Not just "CPU and memory"—we track the metrics that matter for troubleshooting and prediction.

The 14 LINBORG Metrics:

✓ CPU User %

✓ CPU System %

✓ CPU I/O Wait %

✓ CPU Idle %

✓ Java/Spark CPU %

✓ Memory Used %

✓ Swap Usage %

✓ Disk Usage %

✓ Network In (MB/s)

✓ Network Out (MB/s)

✓ Backend Connections

✓ Frontend Connections

✓ Load Average

✓ Uptime (Days)

Why these metrics? They're what SREs look at when troubleshooting. I/O wait tells you about storage bottlenecks. Connection counts reveal network issues. These aren't random—they're battle-tested.

2️⃣

AI Analysis: Temporal Fusion Transformer

Our AI model analyzes patterns using Temporal Fusion Transformers (TFT)—state-of-the-art deep learning for time series forecasting. Not just "if CPU > 80% alert"—we understand trends, correlations, and context.

What TFT Does:

24-hour context window: Remembers what happened over the last day
8-hour prediction horizon: Forecasts 30min, 1hr, 4hr, 8hr ahead
Attention mechanism: Knows which metrics matter most (like a senior SRE)
Multi-metric correlation: Sees compound patterns (CPU + memory + I/O)
Transfer learning: New servers benefit from patterns learned across your fleet

Think of it like this: A junior engineer says "CPU is high." A senior SRE says "CPU is high AND memory is climbing AND I/O wait is elevated—this is a memory leak in the Java app." TFT thinks like the senior SRE.

3️⃣

Risk Scoring: Contextual Intelligence

Every server gets a Risk Score (0-100) based on current state, predictions, trends, and profile. Not just "CPU > 80% = bad"—we understand context.

Risk Score Formula:

Final Risk = (Current State × 70%) + (Predictions × 30%)

The Four Context Factors:

1. Server Profile Awareness

Database at 98% memory = ✅ Healthy (page cache). ML server at 98% memory = 🔴 Critical (OOM imminent).

2. Trend Analysis

40% CPU steady for 2 hours = ✅ Fine. 40% CPU climbing from 20% in 10 minutes = 🟡 Warning (will hit 100%).

3. Multi-Metric Correlation

CPU 85% alone = Low risk (batch job). CPU 85% + Memory 90% + I/O 25% = 🔴 Critical (compound stress).

4. Prediction-Aware

Current 40%, Predicted 95% in 30min = 🟡 Warning (early alert). Current 85%, Predicted 60% = 👁️ Watch (resolving).

4️⃣

Early Warning Alerts: Graduated Severity

Not binary "OK or ON FIRE." We have 4 graduated severity levels that give you time to respond appropriately.

Alert Levels (Risk Score 0-100):

🔴 Critical (70-100) Immediate action required - page on-call

🟠 Warning (40-69) Needs attention - investigate within 1 hour

🟡 Watch (20-39) Minor concerns - trending upward

🟢 Healthy (0-19) Normal operations - no action needed

Key insight: You don't go from "Healthy" to "Critical" instantly. You progress through Watch → Warning → Critical, giving you 30-60 minutes to respond proactively. Early detection = early action.

5️⃣

Proactive Response: Fix Before Impact

With 30-60 minutes advance warning, you can:

Schedule planned maintenance during business hours (not 3 AM)
Gracefully restart services during low-traffic windows
Scale infrastructure before peak demand hits
Fix memory leaks before they cause OOM kills
Clear disk space before it fills completely
Investigate root causes while systems are still operational

Result: Zero user impact. Zero emergency pages. Zero SLA violations.

Problems prevented, not fought.

Real-World Example: Memory Leak Detection

Scenario: ML Training Server (ppml0042)

T-60 minutes: NordIQ detects pattern

Current memory: 72%
Predicted memory (30min): 94%
Trend: Climbing 2% every 5 minutes (memory leak!)
Risk Score: 58 (🟠 Warning)
Alert: "ppml0042 memory leak detected, OOM in ~1 hour"

T-55 minutes: Ops team investigates

Correlates with deployment 2 hours ago
Identifies problematic training job
Decision: Graceful restart during low-usage window (11 PM)

T-50 minutes: Action taken

Schedule automated restart for 11 PM
Notify affected teams
Prepare rollback plan

11:05 PM: Automated restart

Service restarted during low-traffic window
Memory drops back to 45%
Total downtime: 2 minutes (planned)

3:47 AM: Everyone sleeps soundly

No pager alerts
No customer impact
No incident report needed
Cost: $0 lost, 0 customers impacted

That's the power of predictive monitoring.

Profile-Based Transfer Learning

How new servers get accurate predictions immediately—no retraining required

The Problem with Traditional ML

Most AI treats every server as unique. Add a new server → retrain the model. Remove a server → retrain the model. Change fleet size → retrain everything. It's expensive, slow, and impractical.

Our Solution: 7 Server Profiles

We recognize that servers of the same type behave similarly. A new ML training server (ppml0099) behaves like existing ML servers (ppml0001-ppml0098). So we learn patterns at the profile level, not individual server level.

ML Compute (ppml####)
High CPU/memory during training, Memory-intensive workloads

Database (ppdb###)
High memory (page cache normal), Query CPU spikes expected

Web API (ppweb###)
Low memory (stateless), Latency-sensitive, Network-heavy

Data Ingest (ppdi###)
High disk I/O, Streaming workloads, Network-intensive

Conductor/Risk/Generic
Specialized workloads and fallback category

Benefits

⚡

Instant Intelligence

New servers get strong predictions from day 1. No waiting for training data.

🔄

80% Less Retraining

Add/remove servers without retraining. Only retrain when patterns shift.

📈

13% Better Accuracy

Profile-specific knowledge beats one-size-fits-all models.

System Architecture

Microservices Design

1. Production Adapters

Pull metrics from your sources (MongoDB, Elasticsearch, Prometheus) every 5 seconds. Send to inference daemon via REST API.

2. Inference Daemon

GPU-accelerated TFT model. <100ms latency per server. REST API (port 8000) + WebSocket streaming. Real-time predictions.

3. Dashboard

Interactive web UI (port 8501). Fleet overview, heatmap, alerts, trends, cost avoidance calculator. Updates in real-time via WebSocket.

4. Adaptive Retraining

Drift detection monitors accuracy. Auto-retrains when needed (not on fixed schedule). Validates before deployment.

The Technology Stack

🧠 AI Model

Temporal Fusion Transformers (TFT)
Google Research, 2019. State-of-the-art for time series. 88K parameters, <100ms inference.

⚡ Runtime

PyTorch + GPU Acceleration
NVIDIA RTX optimization. Batch processing. <2s dashboard load time.

🔧 Framework

Python + Streamlit
Production-ready. Microservices. REST + WebSocket APIs. Easy integration.

📊 Data

Parquet + Time Series
14 LINBORG metrics. 5-second granularity. Efficient storage and querying.

Ready to See It in Action?

Get a personalized demo and see NordIQ predict failures in your infrastructure.

Request a Demo View Pricing

Questions? Message us on Facebook

How NordIQ Works