Upcoming Pandora FMS Workshop: June 11. More information →

GPU monitoring: GPU monitoring for AI and hybrid environments

In the good old days, as some would say, when you did not need a mortgage to buy one, the GPU (Graphics Processing Unit) was that component that only mattered to those who wanted to play Crysis at full resolution. Today, it is the antimatter reactor powering half the industry: AI model training, inference, machine learning, simulation, rendering, data analysis and HPC (High-Performance Computing). Without GPUs there is no AI, and without AI, according to the visionaries (who sell it), there is no future. One that, according to those same people, may end up destroyed anyway by that very artificial intelligence.
But in many cases, the diamond heart of the infrastructure remains unsupervised.
Servers, CPUs, networks, services… are monitored with a hawk’s eye, but GPUs remain the most expensive blind spot, one that, if left unattended, can turn into a hole in your pocket.
That is why the real challenge is not just measuring the GPU, but integrating it into the monitoring of on-premise, hybrid and cloud infrastructures.
Here we will see how.

Qué es GPU monitoring o monitorización de Unidades de Procesamiento Gráfico

El GPU monitoring es la supervisión continua del estado, uso, memoria, temperatura, consumo, disponibilidad y errores de las GPUs. Que dicho así, suena a panel de control de nave estelar… y no andamos lejos.
Es obvio dada la introducción, pero para que quede claro, aquí no hablamos de overclocking ni exprimir frames, sino de monitorización de recursos críticos usados por cargas de IA, HPC o procesamiento intensivo, dentro de una estrategia profesional de monitorización de infraestructura.

Why GPU monitoring matters in AI infrastructures

GPUs are used to accelerate an organization’s critical operations, but since nothing in life comes for free, they also introduce new operational risks. This makes them both the engine of any AI project and its Achilles’ heel.
Why?

  • Because they are increasingly expensive and limited resources. Every idle or saturated card, just like every minute of downtime, is money going down the drain.
  • Saturation affects inference and training. A GPU wheezing at its limit slows down responses and extends jobs that already take long enough as it is.
  • Lack of memory causes errors or degradation. An out-of-memory error in the middle of training is the blue screen of death of our time, only much more expensive.
  • Temperature impacts performance and availability, because, just like with humans, heat degrades, and too much heat shuts things down.
  • Underutilization makes it harder to justify the investment and forces explanations when management looks at the expense column that takes up three sheets of continuous stationery.
  • Historical data helps make capacity decisions, but a lack of visibility complicates operations across hybrid, on-premise and cloud environments.

The conclusion is that leaving the GPU out of the monitoring family picture is madness in these times we are living through.

The real gap: GPU monitoring for on-premise, hybrid and enterprise environments

There is a widespread belief that AI lives in the cloud, in a data center managed by someone else and devastating an ecosystem far removed from our own. However, many organizations operate GPUs in their own data centers, in hybrid environments, research centers, industrial settings, cloud instances with GPU…
And many of these organizations operate under confidentiality, data sovereignty or data legislation requirements where a 100% cloud solution is not viable, under penalty of fines or the risk of dangerous leaks.
In these cases, it is not enough to install Ollama or similar tools and forget about it, nor is the current documentation from a cloud provider enough to operate with confidence and efficiency.
We need a platform capable of watching over everything, GPUs and the rest of the infrastructure as a whole, whether in own servers or virtual environments.
The never-ending debate between on-premise and SaaS, or the requirements of hyperconverged infrastructures, reinforces our need for control. Because a GPU may live on-premise, in a hybrid ecosystem or in a cloud instance, but its impact affects services, network, processes, APIs, applications….
Hence the need to get everyone a little closer together so the GPU fits into that family picture and include it in global monitoring, not as a stray metric in a tab nobody opens.
And speaking of them…

Key metrics in GPU monitoring

We need to keep an eye on things, fine, but what exactly?
The metrics of any GPU monitoring strategy worthy of the name are, in addition to their number and names:

  • GPU status (its operational state).
  • GPU utilization (as a percentage of usage).
  • GPU memory used / free / total (general memory status, both raw and as a percentage).
  • GPU temperature to know whether the AI warp reactor is about to explode or not.
  • Power consumption and power limit to avoid bills the size of Superman’s cape.
  • Fan speed when this applies, of course, as we will see in a moment.
  • Critical errors.
  • NVIDIA driver and CUDA versions.

A couple of nuances, because they hide the details that will save us from unpleasant surprises.
Critical errors are not a universal metric. In a data center with ECC support, such as A100, H100, Tesla or RTX PRO, they are relevant, but in consumer-grade cards, such as GeForce RTX or similar models, NVIDIA’s command-line utility nvidia-smi may return N/A for ECC errors.
Therefore, our Pandora FMS plugin, for example, will output 0 by design. That eternal zero does not mean the card is immortal, it simply means there is nothing to count.
On the other hand, fan speed only makes sense on cards with active cooling (D’oh, as Homer Simpson would say). On server GPUs or fanless environments, it would not apply, of course.
That is why a monitoring plugin like ours allows that module to be skipped with –include-fan=false, since advanced telemetry is also about knowing what not to measure, so as not to flood everything with noise.

nvidia-smi, DCGM and monitoring platforms: what each one brings

We have already seen the what, now it is time to see how we monitor and also to differentiate between tools, because they are easily confused.
Let’s analyze what each one brings.
On the one hand, we have nvidia-smi, which has already shown its face and is the command-line utility that allows you to query information from NVIDIA GPUs. We are talking about utilization, memory, temperature, power consumption, processes, driver, CUDA
This is the GPU tricorder: you point it, read a figure and move on. Perfect for one-off diagnostics or the occasional late-night script.
NVIDIA DCGM is aimed at GPU management in data centers, clusters and advanced deployments. Here we already have broader sensors for our spaceship, because it is useful when there are many GPUs or an integration with observability stacks.
Both are good tools, but the truth is that neither nvidia-smi nor DCGM will solve the entire IT operation for us.
They do not centralize the infrastructure, they do not replace corporate dashboards, they do not provide business-oriented historical reporting and they do not correlate GPU with CPU, RAM, disk, network, logs, services or SLA on their own, for example.
For these reasons, proposals such as Datadog insist on connecting the GPU with the rest of the AI workload.
Or let me put it another way to make it clear. nvidia-smi and DCGM can be excellent data sources, but floating alone in a vacuum.
It is a platform such as Pandora FMS that turns them into operational, actionable knowledge with history, alerts, dashboards, reports, correlation and capacity decisions.
The tricorder gives you a reading; an Enterprise command bridge like Pandora FMS tells you what it means in the context of everything else when it integrates that reading.
And in a crisis, you want to be on the bridge, not decoding loose data in the dark.

The risks of not monitoring GPUs

Leaving the engine unattended has disastrous consequences, such as:

  • Invisible saturation, the kind you only discover when it already hurts.
  • Service degradation, affecting inference and causing slower or failed training jobs.
  • CUDA errors that go unnoticed.
  • Overheating and electricity bills that stretch for miles.
  • Unnecessary purchases of hardware or underutilization of what has already been paid for.
  • Difficulties in relating application incidents to the real pressure on the infrastructure.

It is the plot of so many disaster movies: the signs were there, blinking on some panel, but nobody was looking at them.

From isolated metrics to operational correlation

Isolated GPU monitoring is a thermometer that, with the operational correlation derived from integrating it into global monitoring, becomes diagnosis and operational knowledge.
Knowing that a GPU is at 95% does not say much, just as fever alone does not reveal its cause. What matters is context.
Following the example, a GPU at 95% for ten minutes may be normal, but the same GPU at 95% for four hours, with memory above 85%, errors appearing in the logs and the latency of an inference service rising, is an operational incident.
Another example is temperature. A high reading together with a fan running at warp 9 and performance through the floor may anticipate a physical failure, a breach in the wall that will bring down the castle if ignored.
Connecting those signals is what distinguishes a tool that looks from a platform that understands.
That is the substance of root cause analysis, the daily work of any Network Operations Center or SRE team that takes IT system monitoring seriously.

How Pandora FMS helps with GPU monitoring

In technology, you are either at the cutting edge or you are not really there.
That is the philosophy at Pandora FMS, and that is why the application allows GPU metrics to be integrated into the same console where servers, services, networks, storage, logs, events and availability are already monitored.
This means GPU monitoring is not a footnote, but a key part of machinery optimized for the monitoring and observability that a professional IT infrastructure needs.
To achieve this, Pandora FMS uses a specialist GPU monitoring plugin.
It relies on nvidia-smi and works as a local agent plugin on the host with an NVIDIA GPU, issuing modules that Pandora FMS incorporates into the agent context.
Thus, these metrics can:

  • Be visualized, obviously.
  • Build a history for trends, analysis or whatever is needed.
  • Trigger intelligent alerts.
  • Be included in reports like any other metric.

In addition, the information it issues is critical and necessary for the optimal management of an asset more expensive than adamantium.
For each detected GPU, it issues, among others: GPU_<i>_Status (1 = healthy, 0 = degraded), GPU_<i>_Utilization, GPU_<i>_Memory_Used, GPU_<i>_Memory_Free, GPU_<i>_Memory_Total (in MiB, mebibytes, not bytes), GPU_<i>_Memory_Used_Percent, GPU_<i>_Temperature (°C), GPU_<i>_Power_Draw and GPU_<i>_Power_Limit (W), GPU_<i>_Fan_Speed (optional), GPU_<i>_Critical_Errors and GPU_<i>_Name.
At host scale, it adds three global ones: GPU_Count, GPU_Driver_Version and GPU_CUDA_Version.
The total count comes from a simple formula: 12N + 3 modules with fan speed, or 11N + 3 without it, where N is the number of GPUs on the host.
And one detail reveals a lot about the design. If nvidia-smi is not available, no need to worry, the plugin does not break or produce invalid XML; instead, it issues GPU_Count = 0 in critical status.
The issue remains visible as a clean alert, not as that suspicious silence that may be brewing disaster.
Our goal, because we have experienced firsthand the costly chaos of fragmentation, is that Pandora FMS allows all these signals to be analyzed on the same platform, with integrated alerts or our metaconsole, for example, to have a single point of visibility and control, instead of juggling between commands, logs and scattered dashboards.

Thresholds, alerts and critical statuses

Pandora FMS is here to prevent disaster. To do so, it lifts every last rug and connects the dots to give you something as close as possible to Atreides prescience in Dune. Hence its threshold and alert system.
And so that we can get started right away without building everything from scratch, a situation that often leads us to procrastinate on what matters, the plugin includes predefined, disjointed thresholds that do not overlap and are bounded, with a clear floor and ceiling, for the most sensitive metrics:

  • GPU_<i>_Memory_Used_Percent — Normal: 0-69%. Warning: 70-84%. Critical: 85-100%.
  • GPU_<i>_Temperature — Normal: 0-69 °C. Warning: 70-89 °C. Critical: 90-110 °C.
  • GPU_<i>_Critical_Errors — Normal: 0. Critical: any value greater than 0.

Let’s remember that this critical error makes sense on data center cards with ECC, where any uncorrected error must be treated as a priority, with no middle ground. On consumer GPUs, this does not apply.
With these thresholds, we will detect memory pressure, thermal risk and critical errors without falling into alert overload, that noise that eventually makes everyone ignore everything.
And if an organization needs different values, great. We believe every organization is its own world, so they can be adjusted to our specific reality from the console itself.

Compatibility and requirements of the Pandora FMS plugin

“Promises that mean nothing…”, as the song says. At Pandora FMS, we are not fond of them, so to avoid making claims that later do not hold up, it is worth being clear about what the plugin does and does not do:

  • It is aimed at NVIDIA GPUs and uses nvidia-smi as its data source.
  • It works as a local agent plugin and requires the NVIDIA driver to be installed and nvidia-smi to be available.
  • It is compatible with Linux and Windows.
  • In the cloud, it works with GPU instances from AWS, Azure or Google Cloud as long as the NVIDIA GPU is exposed to the operating system and the driver is properly installed. Providers document this point, as shown in the guides from Google Cloud or AWS.
  • At the moment, it does not support AMD or Intel.
  • Nor does it monitor AI models, prompts, drift or MLOps metrics.
  • The recommended agent interval is at least 30 seconds, although the default value of 300 seconds is suitable for most environments.

And one operational nuance. The plugin is designed for individual hosts or medium-sized environments.
If your case involves clusters with many GPUs per node or large-scale Kubernetes architectures, it may need to be complemented with approaches such as DCGM, Prometheus or other aggregation solutions.
No tool does everything, and acknowledging that seems honest to us.

GPU monitoring and capacity planning

Considering the number of firstborns we must sacrifice to the Machine God to afford a GPU, everything above is more than justified. Either we monitor, or the most critical point becomes a blind spot.
But it is in capacity planning where monitoring translates into hard cash.
With historical data on utilization, memory, temperature and power consumption, it is possible to detect GPUs that are recurrently saturated, identify underused ones, redistribute workloads and justify purchases to the procurement department with data, instead of hunches and pleas, since we will be able to measure the real growth in demand.
In the end, it is data science applied to IT and turning the past into decisions.

GPU monitoring as part of AI infrastructure monitoring

People usually talk about GPUs because, these days, they have climbed to the top of many IT infrastructures, but they do not play the game alone.
An AI infrastructure also depends on CPU, RAM, disk, storage, network, APIs, services, logs, processes, containers, databases and overall availability.
That is why GPU monitoring must be part of a broader AI infrastructure monitoring strategy.
Watching only the GPU is like piloting the Enterprise while paying attention only to what is happening in the antimatter engine, while ignoring shields, life support or even the course.
That is where Pandora FMS becomes that legendary ship computer that knew everything.
Pandora integrates that visibility into a complete IT operation, where AI applied to management, generative AI and deep learning provide the analysis that, let us admit it as a species, no human could sustain manually, nor would they want to, because there is no need.
This is especially relevant in data centers, where availability is taken for granted until it is no longer there, as any good uptime monitoring strategy knows well.

Conclusion

We have covered a lot, but let’s focus on three ideas to take home.

  • That GPUs are critical resources in AI and HPC infrastructures, not a technical detail.
  • That monitoring them in isolation is not enough, since a figure without context is noise.
  • That the value lies in integrating them into the global monitoring of the infrastructure, where a GPU metric talks to servers, services, networks, logs and alerts.

NVIDIA tools provide the reading, but a command bridge is needed to interpret it.
Pandora FMS aims to be that bridge to bring GPU monitoring to on-premise, hybrid and enterprise environments. Because in the gaps of IT live the gremlins of the machine, the ones that cause trouble precisely where you are not watching.

Frequently asked questions

Let’s compile the most important questions about GPU monitoring and their answers.

What is GPU monitoring?

GPU monitoring is the continuous supervision of the status, utilization, memory, temperature, power consumption and errors of Graphics Processing Units in professional environments.
It is used to control GPUs used in AI, HPC, inference, model training or intensive processing.
As I mentioned at the beginning, it should not be confused with gaming, overclocking or graphic tuning tools.

Why is it important to monitor GPUs in AI infrastructures?

Because GPUs are what keep the heart beating and, moreover, very expensive and critical resources for inference, training and machine learning workloads.
Without monitoring, we are blind in our most critical asset, leaving it defenseless against saturation, overheating, errors or underutilization, which have a habit of attacking from the shadows and gaps we leave unwatched.
This increases operational risk and makes it harder to get new hardware investments approved by the finance department.

Which metrics should be monitored on a GPU?

The main ones are utilization, used, free and total memory, percentage of memory used, temperature, instant power consumption, power limit, fan speed, status, critical ECC errors, GPU model, driver version and CUDA version.
Almost nothing, but isolated they give us an unfinished puzzle, so in professional environments, these metrics must be analyzed together with the rest of the infrastructure.

What is the difference between nvidia-smi, DCGM and a monitoring platform?

nvidia-smi is a command-line utility for querying specific metrics from NVIDIA GPUs.
DCGM is aimed at GPU management and monitoring in data centers and clusters.
A platform such as Pandora FMS centralizes those metrics together with the rest of the infrastructure, with history, alerts, dashboards and reports.

At what temperature is a server GPU considered critical?

Based on our experience embedded in the Pandora FMS GPU monitoring plugin, a temperature from 70 °C is considered warning and from 90 °C, critical.
These thresholds serve as an operational reference, although they may vary depending on the GPU model, manufacturer and thermal conditions of the environment.

What are ECC errors in a GPU and why do they matter?

ECC (Error-Correcting Code) errors are memory errors detected in GPUs with ECC support.
Uncorrected errors may indicate hardware failures and must be treated as critical incidents.
They are especially relevant in data center GPUs such as A100, H100, Tesla or RTX PRO. In consumer-grade GPUs, the module is still issued, but with a value of 0 in the Pandora FMS plugin, because there are no ECC errors to count.

Can GPUs be monitored in on-premise environments with Pandora FMS?

Yes. Pandora FMS has a local agent plugin based on nvidia-smi that allows NVIDIA GPUs to be monitored on on-premise servers.
It can also be used in hybrid environments or cloud instances if the NVIDIA GPU is exposed to the operating system and the driver is correctly installed.

How does GPU monitoring help with capacity planning?

Historical utilization, memory, temperature and power consumption data makes it possible to identify GPUs that are recurrently saturated, detect underutilization and redistribute workloads.
It also helps justify hardware purchases or postpone unnecessary expansions with objective usage and capacity data in hand.

What happens if nvidia-smi is not installed on the server?

If nvidia-smi is not available, the Pandora FMS plugin issues the GPU_Count module with a value of 0 in critical status.
This allows an alert to be generated without breaking the agent execution or producing invalid XML.
The issue remains visible as the absence of GPU or a failure in the availability of the data source.

Is GPU monitoring enough to manage an AI infrastructure?

No. It is part of a broader whole.
The GPU is a critical component, but an AI infrastructure also depends on CPU, RAM, storage, network, services…
That is why GPU monitoring must be integrated into a broader AI infrastructure monitoring strategy. Without that, it will be impossible to have a complete operational view.

Habla con el equipo de ventas, pide presupuesto,
o resuelve tus dudas sobre nuestras licencias