¿Qué es GPU monitoring?

GPU monitoring es la supervisión continua del estado, utilización, memoria, temperatura, consumo y errores de las GPUs en entornos profesionales. Se aplica a infraestructuras de IA, HPC, inferencia y entrenamiento de modelos. No debe confundirse con herramientas de gaming, overclocking o tuning gráfico.

¿Qué GPUs soporta el plugin de Pandora FMS?

El plugin soporta GPUs NVIDIA. Utiliza nvidia-smi como fuente de datos y requiere que el driver NVIDIA esté instalado en el host. No soporta AMD ni Intel en la versión actual.

¿Funciona en entornos on-premise y cloud?

Sí. El plugin funciona como agente local en el host con la GPU. Linux está validado (amd64 / arm64). Windows está en validación final. Puede usarse en servidores on-premise, entornos híbridos e instancias cloud con GPU expuesta al sistema operativo. No requiere acceso remoto ni configuración adicional de red.

¿Qué métricas monitoriza?

El plugin cubre utilización y estado de GPU, memoria usada y libre, temperatura, consumo y límite de potencia, errores ECC cuando aplican, y datos técnicos como modelo de GPU, versión de driver y versión CUDA. La documentación técnica del plugin estará disponible en Marketplace.

¿Qué diferencia hay entre nvidia-smi y Pandora FMS para GPU monitoring?

nvidia-smi es una utilidad de línea de comandos útil para consultas puntuales. Pandora FMS usa nvidia-smi como fuente de datos e integra esas métricas en una plataforma con histórico, alertas, dashboards, informes y correlación con el resto de la infraestructura.

¿El plugin monitoriza modelos de IA o métricas MLOps?

No. El plugin monitoriza la infraestructura GPU: hardware, rendimiento, memoria, temperatura y consumo. No monitoriza modelos de IA, prompts, detección de drift ni métricas de MLOps.

AI INFRASTRUCTURE MONITORING

GPU monitoring for AI, HPC and hybrid infrastructures

Monitor NVIDIA GPUs with Pandora FMS and integrate utilization, memory, temperature, power consumption and status data into the same platform where you already monitor servers, networks, storage, services and logs.

Request your trial → Contact us →

Pandora FMS Console · GPU modules

GPU utilizationOK · 47 %

GPU memoryCRITICAL · 91 %

TemperatureWARNING · 82 °C

Power consumption318 W / 350 W

Critical errorsOK · 0

NVIDIA driver550.90.07

Local agent plugin · No remote GPU access

Customers who trust us

GPU MONITORING SOFTWARE

AI, HPC, inference and training infrastructures rely on GPUs that concentrate cost, performance and operational risk. A saturated, overheated, underused GPU or one with undetected errors can degrade services, slow down critical processes or cause production failures.

Pandora FMS integrates GPU metrics into your existing IT operations, alongside servers, networks, storage, services and logs. No separate platforms.

Detect sustained saturation

Control temperature and power consumption

Identify underutilization

Generate historical data for capacity planning

Integrate GPU monitoring with the rest of your infrastructure

Request a demo →

Pandora FMS Console · GPU status

GPU utilizationOK · 47 %

GPU memoryCRITICAL · 91 %

TemperatureWARNING · 82 °C

Power consumption318 W / 350 W

Critical errorsOK · 0

Predictive AI applied to IT infrastructure monitoring

HOW IT WORKS

How GPU monitoring works in Pandora FMS

The plugin runs as a local agent on the host with the NVIDIA GPU, uses nvidia-smi as its data source and generates modules that Pandora FMS incorporates into its operations.

01

Host with NVIDIA GPU

On-premise, hybrid or cloud

02

nvidia-smi

Local data source

03

Pandora FMS local plugin

Agent that generates XML modules

04

Dashboards, alerts and reports

Integrated with the rest of your infrastructure

Local agent plugin

Based on nvidia-smi

No dependency on cloud APIs

On-premise, hybrid and cloud

USE CASES

From metric to incident: GPU monitoring in real operations

Knowing that a GPU is at 95% is not enough. Operational context is what matters.

01

Sustained saturation

A GPU running at 95% for hours, with high memory usage and errors in the inference service, is not a normal spike. It is an incident that requires intervention. Historical data makes it possible to distinguish one from the other.

02

Thermal risk

Sustained high temperature combined with fan anomalies can anticipate physical degradation. Detecting it before failure enables preventive intervention instead of reacting to an outage.

03

Underutilization

An expensive GPU with low usage for weeks may indicate poor workload allocation. Historical data provides objective evidence to justify or postpone hardware decisions.

04

Capacity planning

Historical utilization and memory data helps identify demand growth, anticipate saturation and plan expansions based on data instead of estimates.

MONITORED METRICS

What you can monitor with Pandora FMS

Pandora FMS collects key NVIDIA GPU metrics to detect saturation, memory pressure, thermal risk, errors and capacity issues.

Performance

GPU utilization (%)
GPU operational status

Memory

Used, free and total memory (MiB)
Memory usage percentage

Temperature and power

Temperature (°C)
Instant power consumption and power limit (W)
Fan speed where applicable

Health and technical information

ECC errors where applicable
GPU model and driver version
Supported CUDA version

For technical teams: the plugin generates individual metrics per GPU and global host metrics through nvidia-smi. The technical documentation for the plugin will be available in Marketplace.

ALERTS

Alerts to detect saturation, temperature and critical errors

Pandora FMS lets you generate alerts on GPU metrics to detect memory pressure, high temperatures, ECC errors or loss of availability. Thresholds can be adjusted from the console according to the GPU model and the operational policy.

High GPU memory

High temperature

ECC errors where applicable

GPU unavailable

Loss of nvidia-smi data

Predefined thresholds serve as a reference and can be modified from the Pandora FMS console.

COMPATIBILITY

Compatibility and requirements

The plugin is designed for on-premise, hybrid and cloud environments with NVIDIA GPUs exposed to the operating system.

NVIDIA GPUs
Linux (amd64 / arm64) — validated
Windows (amd64) — in final validation
On-premise and hybrid environments
AWS, Azure and Google Cloud if the GPU is exposed to the OS
Requires the NVIDIA driver to be installed and nvidia-smi available on the host.

Current limitations

Does not support AMD or Intel GPUs
Does not monitor AI models, prompts or MLOps metrics
Does not include drift detection or full AI observability
For clusters with many GPUs per node, it may be necessary to complement it with DCGM or other aggregation solutions

Local agent plugin. No remote GPU access or additional network configuration required.

Want to validate whether your NVIDIA GPUs are compatible with Pandora FMS?

Contact us →

WHY PANDORA FMS?

Why choose Pandora FMS for GPU monitoring?

Pandora FMS is not an isolated GPU tool. It is the platform where those metrics gain real operational value.

01

One console for infrastructure and GPUs

GPU metrics are integrated into the same console where you monitor servers, networks, storage, services and logs. No separate platforms.

02

On-premise, hybrid and cloud

Monitor GPUs in your own datacenters, hybrid environments and cloud instances with NVIDIA GPUs exposed to the OS, without depending on a specific provider.

03

No isolated dashboards

GPU metrics become part of existing operations: history, events, alerts, reports and dashboards within the same platform.

04

Alerts, history and reports

Every GPU metric can generate alerts, be stored historically and appear in reports. The same operating model used for servers and networks can also be applied to GPUs.

RELATED RESOURCES

Expand your knowledge about GPU monitoring

GPU monitoring for AI infrastructures and hybrid environments

IT Topic

GPU monitoring: GPU monitoring for AI and hybrid environments

What metrics to monitor, the difference between nvidia-smi and monitoring platforms, and how to integrate GPU monitoring into an AI infrastructure monitoring strategy.

Read the article →

Comprehensive server and IT infrastructure monitoring

Solution

Server and infrastructure monitoring

Pandora FMS lets you monitor physical, virtual and cloud servers from a single platform. GPUs are integrated into this global context.

View solution →

Solution

AI applied to IT management and smart monitoring

Anomaly detection, prediction and automation across your IT infrastructure. GPU monitoring is part of a broader AI infrastructure monitoring strategy.

View solution →

Frequently asked questions about GPU monitoring

Concept

What is GPU monitoring?

GPU monitoring is the continuous supervision of GPU status, utilization, memory, temperature, power consumption and errors in professional environments. It applies to AI, HPC, inference and model training infrastructures. It should not be confused with gaming, overclocking or graphics tuning tools.

Which GPUs does the Pandora FMS plugin support?

The plugin supports NVIDIA GPUs. It uses nvidia-smi as its data source and requires the NVIDIA driver to be installed on the host. It does not support AMD or Intel in the current version.

Compatibility

Does it work in on-premise and cloud environments?

Yes. The plugin runs as a local agent on the host with the GPU. Linux is validated (amd64 / arm64). Windows is in final validation. It can be used on on-premise servers, hybrid environments and cloud instances with GPUs exposed to the operating system. It requires no remote access or additional network configuration.

What metrics does it monitor?

The plugin covers GPU utilization and status, used and free memory, temperature, power consumption and power limit, ECC errors where applicable, and technical data such as GPU model, driver version and CUDA version. The technical documentation for the plugin will be available in Marketplace.

Differences

What is the difference between nvidia-smi and Pandora FMS for GPU monitoring?

nvidia-smi is a command-line utility useful for one-off queries. Pandora FMS uses nvidia-smi as a data source and integrates those metrics into a platform with history, alerts, dashboards, reports and correlation with the rest of the infrastructure.

Does the plugin monitor AI models or MLOps metrics?

No. The plugin monitors GPU infrastructure: hardware, performance, memory, temperature and power consumption. It does not monitor AI models, prompts, drift detection or MLOps metrics.

GPU monitoring with Pandora FMS