Upcoming Pandora FMS Workshop: July 16. More information →

AI INFRASTRUCTURE MONITORING

GPU monitoring for AI, HPC and hybrid infrastructures

Monitor NVIDIA GPUs with Pandora FMS and integrate utilization, memory, temperature, power consumption and status data into the same platform where you already monitor servers, networks, storage, services and logs.

Pandora FMS Console Ā· GPU modules
GPU utilizationOK Ā· 47 %
GPU memoryCRITICAL Ā· 91 %
TemperatureWARNING · 82 °C
Power consumption318 W / 350 W
Critical errorsOK Ā· 0
NVIDIA driver550.90.07
Local agent plugin Ā· No remote GPU access

Customers who trust us

GPU MONITORING SOFTWARE

Your GPUs cannot be a blind spot in your IT infrastructure

AI, HPC, inference and training infrastructures rely on GPUs that concentrate cost, performance and operational risk. A saturated, overheated, underused GPU or one with undetected errors can degrade services, slow down critical processes or cause production failures.

Pandora FMS integrates GPU metrics into your existing IT operations, alongside servers, networks, storage, services and logs. No separate platforms.

Detect sustained saturation
Control temperature and power consumption
Identify underutilization
Generate historical data for capacity planning
Integrate GPU monitoring with the rest of your infrastructure

Pandora FMS Console Ā· GPU status

GPU utilizationOK Ā· 47 %
GPU memoryCRITICAL Ā· 91 %
TemperatureWARNING · 82 °C
Power consumption318 W / 350 W
Critical errorsOK Ā· 0
Predictive AI applied to IT infrastructure monitoring

HOW IT WORKS

How GPU monitoring works in Pandora FMS

The plugin runs as a local agent on the host with the NVIDIA GPU, uses nvidia-smi as its data source and generates modules that Pandora FMS incorporates into its operations.

01

Host with NVIDIA GPU

On-premise, hybrid or cloud

02

nvidia-smi

Local data source

03

Pandora FMS local plugin

Agent that generates XML modules

04

Dashboards, alerts and reports

Integrated with the rest of your infrastructure

Local agent plugin
Based on nvidia-smi
No dependency on cloud APIs
On-premise, hybrid and cloud

USE CASES

From metric to incident: GPU monitoring in real operations

Knowing that a GPU is at 95% is not enough. Operational context is what matters.

01

Sustained saturation

A GPU running at 95% for hours, with high memory usage and errors in the inference service, is not a normal spike. It is an incident that requires intervention. Historical data makes it possible to distinguish one from the other.

02

Thermal risk

Sustained high temperature combined with fan anomalies can anticipate physical degradation. Detecting it before failure enables preventive intervention instead of reacting to an outage.

03

Underutilization

An expensive GPU with low usage for weeks may indicate poor workload allocation. Historical data provides objective evidence to justify or postpone hardware decisions.

04

Capacity planning

Historical utilization and memory data helps identify demand growth, anticipate saturation and plan expansions based on data instead of estimates.

MONITORED METRICS

What you can monitor with Pandora FMS

Pandora FMS collects key NVIDIA GPU metrics to detect saturation, memory pressure, thermal risk, errors and capacity issues.

Performance
  • GPU utilization (%)
  • GPU operational status
Memory
  • Used, free and total memory (MiB)
  • Memory usage percentage
Temperature and power
  • Temperature (°C)
  • Instant power consumption and power limit (W)
  • Fan speed where applicable
Health and technical information
  • ECC errors where applicable
  • GPU model and driver version
  • Supported CUDA version
For technical teams: the plugin generates individual metrics per GPU and global host metrics through nvidia-smi. The technical documentation for the plugin will be available in Marketplace.

ALERTS

Alerts to detect saturation, temperature and critical errors

Pandora FMS lets you generate alerts on GPU metrics to detect memory pressure, high temperatures, ECC errors or loss of availability. Thresholds can be adjusted from the console according to the GPU model and the operational policy.

High GPU memory
High temperature
ECC errors where applicable
GPU unavailable
Loss of nvidia-smi data

Predefined thresholds serve as a reference and can be modified from the Pandora FMS console.

COMPATIBILITY

Compatibility and requirements

The plugin is designed for on-premise, hybrid and cloud environments with NVIDIA GPUs exposed to the operating system.

  • NVIDIA GPUs
  • Linux (amd64 / arm64) — validated
  • Windows (amd64) — in final validation
  • On-premise and hybrid environments
  • AWS, Azure and Google Cloud if the GPU is exposed to the OS
  • Requires the NVIDIA driver to be installed and nvidia-smi available on the host.

Current limitations

  • Does not support AMD or Intel GPUs
  • Does not monitor AI models, prompts or MLOps metrics
  • Does not include drift detection or full AI observability
  • For clusters with many GPUs per node, it may be necessary to complement it with DCGM or other aggregation solutions
Local agent plugin. No remote GPU access or additional network configuration required.

Want to validate whether your NVIDIA GPUs are compatible with Pandora FMS?

Contact us →
WHY PANDORA FMS?

Why choose Pandora FMS for GPU monitoring?

Pandora FMS is not an isolated GPU tool. It is the platform where those metrics gain real operational value.

01

One console for infrastructure and GPUs

GPU metrics are integrated into the same console where you monitor servers, networks, storage, services and logs. No separate platforms.

02

On-premise, hybrid and cloud

Monitor GPUs in your own datacenters, hybrid environments and cloud instances with NVIDIA GPUs exposed to the OS, without depending on a specific provider.

03

No isolated dashboards

GPU metrics become part of existing operations: history, events, alerts, reports and dashboards within the same platform.

04

Alerts, history and reports

Every GPU metric can generate alerts, be stored historically and appear in reports. The same operating model used for servers and networks can also be applied to GPUs.

Frequently asked questions about GPU monitoring

Concept

What is GPU monitoring?

GPU monitoring is the continuous supervision of GPU status, utilization, memory, temperature, power consumption and errors in professional environments. It applies to AI, HPC, inference and model training infrastructures. It should not be confused with gaming, overclocking or graphics tuning tools.

Which GPUs does the Pandora FMS plugin support?

The plugin supports NVIDIA GPUs. It uses nvidia-smi as its data source and requires the NVIDIA driver to be installed on the host. It does not support AMD or Intel in the current version.

Compatibility

Does it work in on-premise and cloud environments?

Yes. The plugin runs as a local agent on the host with the GPU. Linux is validated (amd64 / arm64). Windows is in final validation. It can be used on on-premise servers, hybrid environments and cloud instances with GPUs exposed to the operating system. It requires no remote access or additional network configuration.

What metrics does it monitor?

The plugin covers GPU utilization and status, used and free memory, temperature, power consumption and power limit, ECC errors where applicable, and technical data such as GPU model, driver version and CUDA version. The technical documentation for the plugin will be available in Marketplace.

Differences

What is the difference between nvidia-smi and Pandora FMS for GPU monitoring?

nvidia-smi is a command-line utility useful for one-off queries. Pandora FMS uses nvidia-smi as a data source and integrates those metrics into a platform with history, alerts, dashboards, reports and correlation with the rest of the infrastructure.

Does the plugin monitor AI models or MLOps metrics?

No. The plugin monitors GPU infrastructure: hardware, performance, memory, temperature and power consumption. It does not monitor AI models, prompts, drift detection or MLOps metrics.

GPU monitoring with Pandora FMS

Start monitoring your NVIDIA GPUs with Pandora FMS

Integrate NVIDIA GPU monitoring into your IT operations and turn isolated metrics into alerts, historical data, dashboards and operational reports.