From data collection to early warning alerts. The technology behind predictive infrastructure monitoring, explained for executives and engineers.
Every 5 seconds, we collect data from all your servers. Not just "CPU and memory"—we track the metrics that matter for troubleshooting and prediction.
Why these metrics? They're what SREs look at when troubleshooting. I/O wait tells you about storage bottlenecks. Connection counts reveal network issues. These aren't random—they're battle-tested.
Our AI model analyzes patterns using Temporal Fusion Transformers (TFT)—state-of-the-art deep learning for time series forecasting. Not just "if CPU > 80% alert"—we understand trends, correlations, and context.
Think of it like this: A junior engineer says "CPU is high." A senior SRE says "CPU is high AND memory is climbing AND I/O wait is elevated—this is a memory leak in the Java app." TFT thinks like the senior SRE.
Every server gets a Risk Score (0-100) based on current state, predictions, trends, and profile. Not just "CPU > 80% = bad"—we understand context.
Final Risk = (Current State × 70%) + (Predictions × 30%)
Database at 98% memory = ✅ Healthy (page cache). ML server at 98% memory = 🔴 Critical (OOM imminent).
40% CPU steady for 2 hours = ✅ Fine. 40% CPU climbing from 20% in 10 minutes = 🟡 Warning (will hit 100%).
CPU 85% alone = Low risk (batch job). CPU 85% + Memory 90% + I/O 25% = 🔴 Critical (compound stress).
Current 40%, Predicted 95% in 30min = 🟡 Warning (early alert). Current 85%, Predicted 60% = 👁️ Watch (resolving).
Not binary "OK or ON FIRE." We have 4 graduated severity levels that give you time to respond appropriately.
Key insight: You don't go from "Healthy" to "Critical" instantly. You progress through Watch → Warning → Critical, giving you 30-60 minutes to respond proactively. Early detection = early action.
With 30-60 minutes advance warning, you can:
Result: Zero user impact. Zero emergency pages. Zero SLA violations.
Problems prevented, not fought.
T-60 minutes: NordIQ detects pattern
T-55 minutes: Ops team investigates
T-50 minutes: Action taken
11:05 PM: Automated restart
3:47 AM: Everyone sleeps soundly
That's the power of predictive monitoring.
How new servers get accurate predictions immediately—no retraining required
Most AI treats every server as unique. Add a new server → retrain the model. Remove a server → retrain the model. Change fleet size → retrain everything. It's expensive, slow, and impractical.
We recognize that servers of the same type behave similarly. A new ML training server (ppml0099) behaves like existing ML servers (ppml0001-ppml0098). So we learn patterns at the profile level, not individual server level.
New servers get strong predictions from day 1. No waiting for training data.
Add/remove servers without retraining. Only retrain when patterns shift.
Profile-specific knowledge beats one-size-fits-all models.
Pull metrics from your sources (MongoDB, Elasticsearch, Prometheus) every 5 seconds. Send to inference daemon via REST API.
GPU-accelerated TFT model. <100ms latency per server. REST API (port 8000) + WebSocket streaming. Real-time predictions.
Interactive web UI (port 8501). Fleet overview, heatmap, alerts, trends, cost avoidance calculator. Updates in real-time via WebSocket.
Drift detection monitors accuracy. Auto-retrains when needed (not on fixed schedule). Validates before deployment.
Temporal Fusion Transformers (TFT)
Google Research, 2019. State-of-the-art for time series. 88K parameters, <100ms inference.
PyTorch + GPU Acceleration
NVIDIA RTX optimization. Batch processing. <2s dashboard load time.
Python + Streamlit
Production-ready. Microservices. REST + WebSocket APIs. Easy integration.
Parquet + Time Series
14 LINBORG metrics. 5-second granularity. Efficient storage and querying.
Get a personalized demo and see NordIQ predict failures in your infrastructure.
Questions? Email craig@nordiqai.io