{"id":424589,"date":"2026-06-11T13:34:43","date_gmt":"2026-06-11T13:34:43","guid":{"rendered":"https:\/\/pandorafms.com\/?p=424589"},"modified":"2026-06-12T10:15:55","modified_gmt":"2026-06-12T10:15:55","slug":"gpu-monitoring","status":"publish","type":"post","link":"https:\/\/pandorafms.com\/en\/it-topics\/gpu-monitoring\/","title":{"rendered":"GPU monitoring: GPU monitoring for AI and hybrid environments"},"content":{"rendered":"<p>[et_pb_section fb_built=&#8221;1&#8243; admin_label=&#8221;Section&#8221; _builder_version=&#8221;4.22.0&#8243; _module_preset=&#8221;default&#8221; custom_margin=&#8221;0px||0px||false|false&#8221; custom_padding=&#8221;0px||0px||false|false&#8221; locked=&#8221;off&#8221; global_colors_info=&#8221;{}&#8221;][et_pb_row column_structure=&#8221;1_4,3_4&#8243; _builder_version=&#8221;4.27.0&#8243; _module_preset=&#8221;default&#8221; custom_padding=&#8221;50px||||false|false&#8221; custom_css_main_element=&#8221;z-index:0!important;&#8221; global_colors_info=&#8221;{}&#8221;][et_pb_column type=&#8221;1_4&#8243; disabled_on=&#8221;on|on|off&#8221; _builder_version=&#8221;4.22.0&#8243; _module_preset=&#8221;default&#8221; custom_padding=&#8221;||||false|false&#8221; sticky_position=&#8221;top&#8221; sticky_offset_top=&#8221;100px&#8221; sticky_limit_bottom=&#8221;section&#8221; motion_trigger_start=&#8221;top&#8221; global_colors_info=&#8221;{}&#8221;][et_pb_text admin_label=&#8221;indice&#8221; _builder_version=&#8221;4.27.6&#8243; _module_preset=&#8221;default&#8221; custom_margin=&#8221;||0px||false|false&#8221; custom_padding=&#8221;||14px||false|false&#8221; link_option_url=&#8221;#1&#8243; global_colors_info=&#8221;{}&#8221;]<\/p>\n<p style=\"font-size: 0.9em; line-height: 1.4em; color: #333333;\"><strong>Sections<\/strong><\/p>\n<ul class=\"ittopicsul\">\n<li><a href=\"#1\">What GPU monitoring is<\/a><\/li>\n<li><a href=\"#2\">Why GPU monitoring matters<\/a><\/li>\n<li><a href=\"#3\">On-premise, hybrid and enterprise environments<\/a><\/li>\n<li><a href=\"#4\">Key metrics in GPU monitoring<\/a><\/li>\n<li><a href=\"#5\">nvidia-smi, DCGM and monitoring platforms<\/a><\/li>\n<li><a href=\"#6\">The risks of not monitoring GPUs<\/a><\/li>\n<li><a href=\"#7\">From isolated metrics to operational correlation<\/a><\/li>\n<li><a href=\"#8\">How Pandora FMS helps with GPU monitoring<\/a><\/li>\n<li><a href=\"#9\">Thresholds, alerts and critical statuses<\/a><\/li>\n<li><a href=\"#10\">How Pandora FMS helps monitor databases<\/a><\/li>\n<li><a href=\"#11\">Compatibility and requirements of the Pandora FMS plugin<\/a><\/li>\n<li><a href=\"#12\">GPU monitoring and capacity planning<\/a><\/li>\n<li><a href=\"#13\">Part of AI infrastructure monitoring<\/a><\/li>\n<li><a href=\"#14\">Conclusion<\/a><\/li>\n<li><a href=\"#15\">Frequently asked questions<\/a><\/li>\n<\/ul>\n<p>[\/et_pb_text][\/et_pb_column][et_pb_column type=&#8221;3_4&#8243; _builder_version=&#8221;4.27.0&#8243; _module_preset=&#8221;default&#8221; custom_css_main_element=&#8221;z-index:0!important;&#8221; global_colors_info=&#8221;{}&#8221;][et_pb_text admin_label=&#8221;seccion&#8221; module_id=&#8221;1&#8243; module_class=&#8221;ittopicscontent&#8221; _builder_version=&#8221;4.27.6&#8243; _module_preset=&#8221;default&#8221; z_index=&#8221;0&#8243; custom_margin=&#8221;0px||0px||true|false&#8221; custom_padding=&#8221;0px||0px||false|false&#8221; hover_enabled=&#8221;0&#8243; custom_css_main_element=&#8221;font-family:%22Pandora-Light%22;&#8221; locked=&#8221;off&#8221; global_colors_info=&#8221;{}&#8221; sticky_enabled=&#8221;0&#8243;]In the good old days, as some would say, when you did not need a mortgage to buy one, the GPU (Graphics Processing Unit) was that component that only mattered to those who wanted to play Crysis at full resolution. <strong>Today, it is the antimatter reactor powering half the industry<\/strong>: AI model training, inference, machine learning, simulation, rendering, data analysis and HPC (High-Performance Computing). <strong>Without GPUs there is no AI<\/strong>, and without AI, according to the visionaries (who sell it), there is no future. One that, according to those same people, may end up destroyed anyway by that very artificial intelligence.<br \/>\nBut in many cases, <strong>the diamond heart of the infrastructure remains unsupervised<\/strong>.<br \/>\nServers, CPUs, networks, services&#8230; are monitored with a hawk-like intensity&#8230; o closely scrutinized, but <strong>GPUs remain the most expensive blind spot<\/strong>, one that, if left unattended, can turn into can turn into a massive financial drain.<br \/>\nThat is why the real challenge is not just measuring the GPU, but <strong>integrating it into the monitoring of on-premise, hybrid and cloud infrastructures<\/strong>.<br \/>\nHere we will see how.<\/p>\n<h2 id=\"1\">What is GPU monitoring?<\/h2>\n<p>GPU monitoring is the continuous supervision of the health, utilization, memory, temperature, power consumption, availability, and errors of your Graphics Processing Units. Documented like that, it sounds like the control panel of a starship&#8230; and we aren&#8217;t far off.<br \/>\nIt is obvious given the introduction, but to be clear, we are not talking about overclocking or squeezing out extra frames here. We are talking about monitoring critical resources used by AI workloads, HPC, or intensive processing within a professional <a href=\"https:\/\/pandorafms.com\/en\/it-topics\/infrastructure-monitoring\/\" target=\"_blank\" rel=\"noopener\">infrastructure monitoring<\/a> strategy.<\/p>\n<h2 id=\"2\">Why GPU monitoring matters in AI infrastructures<\/h2>\n<p>GPUs are used to accelerate an organization\u2019s critical operations, but since nothing in life comes for free, they also introduce new operational risks. This makes them both the engine of any AI project and its Achilles\u2019 heel.<br \/>\nWhy?<\/p>\n<ul class=\"lista\">\n<li><strong>Because they are increasingly expensive and limited resources<\/strong>. Every idle or saturated card, just like every minute of downtime, is capital burning away.<\/li>\n<li><strong>Saturation affects inference and training<\/strong>. A GPU throttling at its limit slows down responses and extends jobs that already take long enough as it is.<\/li>\n<li><strong>Lack of memory causes errors or degradation<\/strong>. An out-of-memory error in the middle of training is the blue screen of death of our time, only much more expensive.<\/li>\n<li><strong>Temperature impacts performance and availability<\/strong>, because, just like with humans, heat degrades, and too much heat shuts things down.<\/li>\n<li><strong>Underutilization makes it harder to justify the investment<\/strong> and forces explanations when management looks at the the expense column that stretches out for miles.<\/li>\n<li><strong>Historical data helps make capacity decisions<\/strong>, but a lack of visibility complicates operations across hybrid, on-premise and cloud environments.<\/li>\n<\/ul>\n<p>The conclusion is that leaving the GPU out of the monitoring family picture is madness in these times we are living through.<\/p>\n<h2 id=\"3\">The real gap: GPU monitoring for on-premise, hybrid and enterprise environments<\/h2>\n<p>There is a widespread belief that AI lives in the cloud, in a data center managed by someone else and devastating an ecosystem far removed from our own. However, many organizations operate GPUs in their own data centers, in hybrid environments, research centers, industrial settings, cloud instances with GPU&#8230;<br \/>\nAnd many of these organizations <strong>operate under confidentiality, data sovereignty or data legislation requirements<\/strong> where a 100% cloud solution is not viable, under penalty of fines or the risk of dangerous leaks.<br \/>\nIn these cases, it is not enough to install Ollama or similar tools and forget about it, nor is the current documentation from a cloud provider enough to operate with confidence and efficiency.<br \/>\nWe need a platform capable of <strong>watching over everything, GPUs and the rest of the infrastructure<\/strong> as a whole, whether on <a href=\"https:\/\/pandorafms.com\/en\/server-monitoring\/\" target=\"_blank\" rel=\"noopener\">bare-metal servers<\/a> or <a href=\"https:\/\/pandorafms.com\/en\/virtual-monitoring\/\" target=\"_blank\" rel=\"noopener\">virtual environments<\/A>.<br \/>\nThe never-ending debate between <a href=\"https:\/\/pandorafms.com\/blog\/on-premise-vs-saas-2025\/\" target=\"_blank\" rel=\"noopener\">on-premise and SaaS<\/a>, or the requirements of <a href=\"https:\/\/pandorafms.com\/blog\/hyperconverged-infrastructure-monitoring\/\" target=\"_blank\" rel=\"noopener\">hyperconverged infrastructures<\/a>, reinforces our need for control. Because a GPU may live on-premise, in a hybrid ecosystem or in a cloud instance, but <strong>its impact affects services, network, processes, APIs, applications<\/strong>&#8230;.<br \/>\nHence the need to get everyone a little closer together so the GPU fits into that family picture and <strong>include it in global monitoring<\/strong>, not as a stray metric in a tab nobody opens.<br \/>\nAnd speaking of them&#8230;<\/p>\n<h2 id=\"4\">Key metrics in GPU monitoring<\/h2>\n<p>We need to keep an eye on things, fine, but what exactly?<br \/>\n<strong>The metrics of any GPU monitoring strategy<\/strong> worthy of the name are, in addition to their number and names:<\/p>\n<ul class=\"lista\">\n<li><strong>GPU status<\/strong> (its operational state).<\/li>\n<li><strong>GPU utilization<\/strong> (as a percentage of usage).<\/li>\n<li><strong>GPU memory used \/ free \/ total<\/strong> (general memory status, both raw and as a percentage).<\/li>\n<li><strong>GPU temperature<\/strong> to know whether the AI warp reactor is about to explode or not.<\/li>\n<li><strong>Power consumption<\/strong> and power limit to avoid utility bills the size of Superman\u2019s cape.<\/li>\n<li><strong>Fan speed<\/strong> when this applies, of course, as we will see in a moment.<\/li>\n<li><strong>Critical errors<\/strong>.<\/li>\n<li><strong>NVIDIA driver and CUDA versions<\/strong>.<\/li>\n<\/ul>\n<p>A couple of nuances, because they hide the details that will save us from unpleasant surprises.<br \/>\n<strong>Critical errors are not a universal metric<\/strong>. In a data center with ECC support, such as A100, H100, Tesla or RTX PRO, they are relevant, but on consumer-grade cards, such as GeForce RTX or similar models, NVIDIA\u2019s command-line utility <span style=\"color:green; font-family:Pandocode;\">nvidia-smi<\/span> may return N\/A for ECC errors.<br \/>\nTherefore, our Pandora FMS plugin, for example, will output 0 by design. That eternal zero does not mean the card is immortal, it simply means there is nothing to count.<br \/>\nOn the other hand, fan speed <strong>only makes sense on cards with active cooling<\/strong> (D\u2019oh, as Homer Simpson would say). On server GPUs or fanless environments, it would not apply, of course.<br \/>\nThat is why a monitoring plugin like ours allows that module to be skipped with <span style=\"color:green; font-family:Pandocode;\">&#8211;include-fan=false<\/span>, since <a href=\"https:\/\/pandorafms.com\/en\/it-topics\/telemetry-management-infrastructures-pandora-fms\/\" target=\"_blank\" rel=\"noopener\">advanced telemetry<\/a> is also about knowing what not to measure, so as not to flood everything with noise.<\/p>\n<h2 id=\"5\">nvidia-smi, DCGM and monitoring platforms: what each one brings<\/h2>\n<p>We have already seen the what, <strong>now it is time to see how we monitor<\/strong> and also to differentiate between tools, because they are easily confused.<br \/>\nLet\u2019s analyze what each one brings.<br \/>\nOn the one hand, we have <span style=\"font-family:Pandocode;\"><a href=\"https:\/\/docs.nvidia.com\/deploy\/nvidia-smi\/\" target=\"_blank\" rel=\"noopener nofollow\">nvidia-smi<\/A><\/span>, which has already shown its face and is <strong>the command-line utility that allows you to query information from NVIDIA GPUs<\/strong>. We are talking about utilization, memory, temperature, power consumption, processes, driver, <a href=\"https:\/\/docs.nvidia.com\/cuda\/\" target=\"_blank\" rel=\"noopener nofollow\">CUDA<\/A>\u2026<br \/>\nThis is the GPU tricorder: you point it, read a figure and move on. <strong>Perfect for one-off diagnostics or the occasional late-night script<\/strong>.<br \/>\n<a href=\"https:\/\/developer.nvidia.com\/dcgm\" target=\"_blank\" rel=\"noopener nofollow\">NVIDIA DCGM<\/a> is aimed at <strong>GPU management in data centers, clusters and advanced deployments<\/strong>. Here we already have broader sensors for our spaceship, because it is useful when there are many GPUs or an integration with observability stacks.<br \/>\nBoth are good tools, but the truth is that <strong>neither <span style=\"color:green; font-family:Pandocode;\">nvidia-smi<\/span> nor DCGM will solve the entire IT operation for us<\/strong>.<br \/>\nThey do not centralize the infrastructure, they do not replace corporate dashboards, they do not provide business-oriented historical reporting and they do not correlate GPU with CPU, RAM, disk, network, logs, services or SLA on their own, for example.<br \/>\nFor these reasons, proposals such as <a href=\"https:\/\/www.datadoghq.com\/product\/gpu-monitoring\/\" target=\"_blank\" rel=\"noopener nofollow\">Datadog<\/a> insist on connecting the GPU with the rest of the AI workload.<br \/>\nOr let me put it another way to make it clear. <span style=\"color:green; font-family:Pandocode;\">nvidia-smi<\/span> <strong>and DCGM can be excellent data sources, but floating alone in a vacuum<\/strong>.<br \/>\nIt is a platform such as <strong>Pandora FMS that turns them into operational, actionable knowledge<\/strong> with history, alerts, dashboards, reports, correlation and capacity decisions.<br \/>\nThe tricorder gives you a reading; an Enterprise command bridge like Pandora FMS tells you what it means in the context of everything else when it integrates that reading.<br \/>\nAnd in a crisis, you want to be on the bridge, not decoding loose data in the dark.<\/p>\n<h2 id=\"6\">The risks of not monitoring GPUs<\/h2>\n<p>Leaving the engine unattended has disastrous consequences, such as:<\/p>\n<ul class=\"lista\">\n<li><strong>Invisible saturation<\/strong>, the kind you only discover when it already hurts.<\/li>\n<li><strong>Service degradation<\/strong>, affecting inference and causing slower or failed training jobs.<\/li>\n<li><strong>CUDA errors<\/strong> that go unnoticed.<\/li>\n<li><strong>Overheating and electricity bills<\/strong> that stretch for miles.<\/li>\n<li><strong>Unnecessary purchases<\/strong> of hardware or <strong>underutilization<\/strong> of what has already been paid for.<\/li>\n<li><strong>Difficulties in relating application incidents<\/strong> to the real pressure on the infrastructure.<\/li>\n<\/ul>\n<p>It is the plot of so many disaster movies: the signs were there, blinking on some panel, but nobody was looking at them.<\/p>\n<h2 id=\"7\">From isolated metrics to operational correlation<\/h2>\n<p>Isolated GPU monitoring is a thermometer that, with the <strong>operational correlation derived from integrating it into global monitoring<\/strong>, becomes diagnosis and operational knowledge.<br \/>\nKnowing that a GPU is at 95% does not say much, just as fever alone does not reveal its cause. <strong>What matters is context<\/strong>.<br \/>\nFollowing the example, a GPU at 95% for ten minutes may be normal, but the same GPU at 95% for four hours, with memory above 85%, errors appearing in the logs and the latency of an inference service rising, is an <strong>operational incident<\/strong>.<br \/>\nAnother example is temperature. A high reading together with a fan running at warp 9 and performance through the floor may anticipate a physical failure, a breach in the wall that will bring down the castle if ignored.<br \/>\n<strong>Connecting those signals is what distinguishes a tool that looks from a platform that understands.<\/strong><br \/>\nThat is the substance of <a href=\"https:\/\/pandorafms.com\/blog\/root-cause-analysis\/\" target=\"_blank\" rel=\"noopener\">root cause analysis<\/a>, the daily work of any <a href=\"https:\/\/pandorafms.com\/en\/it-topics\/network-operations-center\/\" target=\"_blank\" rel=\"noopener\">Network Operations Center<\/a> or <a href=\"https:\/\/pandorafms.com\/en\/it-topics\/sre\/\" target=\"_blank\" rel=\"noopener\">SRE<\/a> team that takes <a href=\"https:\/\/pandorafms.com\/en\/it-topics\/it-system-monitoring\/\" target=\"_blank\" rel=\"noopener\">IT system monitoring<\/a> seriously.<\/p>\n<h2 id=\"8\">How Pandora FMS helps with GPU monitoring<\/h2>\n<p>In technology, you are either at the cutting edge or you are not really there.<br \/>\nThat is the philosophy at <a href=\"https:\/\/pandorafms.com\/en\/\" target=\"_blank\" rel=\"noopener\">Pandora FMS<\/a>, and that is why the application <strong>allows GPU metrics to be integrated<\/strong> into the same console where servers, services, networks, storage, logs, events and availability are already monitored.<br \/>\nThis means GPU monitoring is not a footnote, but a key part of machinery <strong>optimized for the monitoring and observability that a professional IT infrastructure needs<\/strong>.<br \/>\nTo achieve this, Pandora FMS uses <strong>a specialist GPU monitoring plugin<\/strong>.<br \/>\nIt relies on <span style=\"color:green; font-family:Pandocode;\">nvidia-smi<\/Span> and works as a local agent plugin on the host with an NVIDIA GPU, issuing modules that Pandora FMS incorporates into the agent context.<br \/>\nThus, these metrics can:<\/p>\n<ul class=\"lista\">\n<li>Be visualized, obviously.<\/li>\n<li>Build a history for trends, analysis or whatever is needed.<\/li>\n<li>Trigger intelligent alerts.<\/li>\n<li>Be included in reports like any other metric.<\/li>\n<\/ul>\n<p>In addition, the information it issues is critical and necessary for the optimal management of an asset more expensive than adamantium.<br \/>\nFor each detected GPU, it issues, among others: <span style=\"color:green; font-family:Pandocode;\">GPU_&lt;i>_Status<\/span> (1 = healthy, 0 = degraded), <span style=\"color:green; font-family:Pandocode;\">GPU_&lt;i>_Utilization<\/span>, <span style=\"color:green; font-family:Pandocode;\">GPU_&lt;i>_Memory_Used<\/span>, <span style=\"color:green; font-family:Pandocode;\">GPU_&lt;i>_Memory_Free<\/span>, <span style=\"color:green; font-family:Pandocode;\">GPU_&lt;i>_Memory_Total<\/Span> (in MiB, mebibytes, not bytes), <span style=\"color:green; font-family:Pandocode;\">GPU_&lt;i>_Memory_Used_Percent<\/Span>, <span style=\"color:green; font-family:Pandocode;\">GPU_&lt;i>_Temperature<\/Span> (\u00b0C), <span style=\"color:green; font-family:Pandocode;\">GPU_&lt;i>_Power_Draw<\/span> and <span style=\"color:green; font-family:Pandocode;\">GPU_&lt;i>_Power_Limit<\/span> (W), <span style=\"color:green; font-family:Pandocode;\">GPU_&lt;i>_Fan_Speed<\/Span> (optional), <span style=\"color:green; font-family:Pandocode;\">GPU_&lt;i>_Critical_Errors<\/span> and <span style=\"color:green; font-family:Pandocode;\">GPU_&lt;i>_Name<\/Span>.<br \/>\nAt host scale, it adds three global ones: <span style=\"color:green; font-family:Pandocode;\">GPU_Count<\/span>, <span style=\"color:green; font-family:Pandocode;\">GPU_Driver_Version<\/span> and <span style=\"color:green; font-family:Pandocode;\">GPU_CUDA_Version<\/span>.<br \/>\nThe total count comes from a simple formula: <span style=\"color:green; font-family:Pandocode;\"><strong>12N + 3<\/strong><\/span> modules with fan speed, or <span style=\"color:green; font-family:Pandocode;\"><strong>11N + 3<\/strong><\/span> without it, where N is the number of GPUs on the host.<br \/>\nAnd one detail reveals a lot about the design. <strong>If <span style=\"color:green; font-family:Pandocode;\">nvidia-smi<\/Span> is not available, no need to worry, the plugin does not break or produce invalid XML<\/strong>; instead, it issues <span style=\"color:green; font-family:Pandocode;\">GPU_Count = 0<\/span> in critical status.<br \/>\nThe issue remains visible as a clean alert, not as that suspicious silence that may be brewing disaster.<br \/>\nOur goal, because we have experienced firsthand the costly chaos of fragmentation, is that <strong>Pandora FMS allows all these signals to be analyzed on the same platform<\/strong>, with integrated alerts or our metaconsole, for example, to have a single point of visibility and control, instead of juggling between commands, logs and scattered dashboards.<\/p>\n<h2 id=\"9\">Thresholds, alerts and critical statuses<\/h2>\n<p>Pandora FMS is here to prevent disaster. To do so, it lifts every last rug and connects the dots to give you something as close as possible to Atreides prescience in Dune. Hence its threshold and alert system.<br \/>\nAnd so that we can get started right away without building everything from scratch, a situation that often leads us to procrastinate on what matters, the plugin includes predefined, disjointed thresholds that do not overlap and are bounded, with a clear floor and ceiling, for the most sensitive metrics:<\/p>\n<ul class=\"lista\">\n<li><span style=\"color:green; font-family:Pandocode;\"><strong>GPU_&lt;i>_Memory_Used_Percent<\/Strong><\/span> \u2014 Normal: 0-69%. Warning: 70-84%. Critical: 85-100%.<\/li>\n<li><span style=\"color:green; font-family:Pandocode;\"><strong>GPU_&lt;i>_Temperature<\/Strong><\/span> \u2014 Normal: 0-69 \u00b0C. Warning: 70-89 \u00b0C. Critical: 90-110 \u00b0C.<\/li>\n<li><span style=\"color:green; font-family:Pandocode;\"><strong>GPU_&lt;i>_Critical_Errors<\/Strong><\/span> \u2014 Normal: 0. Critical: any value greater than 0.<\/li>\n<\/ul>\n<p>Let\u2019s remember that this critical error makes sense on data center cards with ECC, where any uncorrected error must be treated as a priority, with no middle ground. On consumer GPUs, this does not apply.<br \/>\nWith these thresholds, we will detect memory pressure, thermal risk and critical errors without falling into <a href=\"https:\/\/pandorafms.com\/blog\/alert-fatigue-monitoring\/\" target=\"_blank\" rel=\"noopener\">alert overload<\/A>, that noise that eventually makes everyone ignore everything.<br \/>\nAnd if an organization needs different values, great. We believe every organization is its own world, so they can be adjusted to our specific reality from the console itself.<\/p>\n<h2 id=\"10\">Compatibility and requirements of the Pandora FMS plugin<\/h2>\n<p>\u201cPromises that mean nothing&#8230;\u201d, as the song says. At Pandora FMS, we are not fond of them, so to avoid making claims that later do not hold up, <strong>it is worth being clear about what the plugin does and does not do<\/Strong>:<\/p>\n<ul class=\"lista\">\n<li>It is aimed at <strong>NVIDIA GPUs<\/strong> and uses <span style=\"color:green; font-family:Pandocode;\">nvidia-smi<\/Span> as its data source.<\/li>\n<li>It works as a <strong>local agent plugin<\/strong> and requires the NVIDIA driver to be installed and <span style=\"color:green; font-family:Pandocode;\">nvidia-smi<\/Span> to be available.<\/li>\n<li>It is <strong>compatible with Linux and Windows<\/strong>.<\/li>\n<li>In the cloud, it works with GPU instances from <a href=\"https:\/\/pandorafms.com\/en\/monitoring-amazon-web-services\/\" target=\"_blank\" rel=\"noopener\">AWS<\/a>, <a href=\"https:\/\/pandorafms.com\/en\/monitoring-microsoft-azure\/\" target=\"_blank\" rel=\"noopener\">Azure<\/a> or <a href=\"https:\/\/pandorafms.com\/en\/google-cloud-monitoring\/\" target=\"_blank\" rel=\"noopener\">Google Cloud<\/a> <strong>as long as the NVIDIA GPU is exposed to the operating system<\/strong> and the driver is properly installed. Providers document this point, as shown in the guides from <a href=\"https:\/\/cloud.google.com\/compute\/docs\/gpus\/monitor-gpus\" target=\"_blank\" rel=\"noopener nofollow\">Google Cloud<\/A> or <a href=\"https:\/\/docs.aws.amazon.com\/dlami\/latest\/devguide\/tutorial-gpu-monitoring-gpumon.html\" target=\"_blank\" rel=\"noopener nofollow\">AWS<\/a>.<\/li>\n<li>At the moment, it <strong>does not support AMD or Intel<\/strong>.<\/li>\n<li>Nor does it monitor AI models, prompts, drift or MLOps metrics.<\/li>\n<li><strong>The recommended agent interval is at least 30 seconds<\/strong>, although the default value of 300 seconds is suitable for most environments.<\/li>\n<\/ul>\n<p>And one operational nuance. <strong>The plugin is designed for individual hosts or medium-sized environments<\/strong>.<br \/>\nIf your case involves clusters with many GPUs per node or large-scale Kubernetes architectures, it may need to be complemented with approaches such as DCGM, Prometheus or other aggregation solutions.<br \/>\nNo tool does everything, and acknowledging that seems honest to us.<\/p>\n<h2 id=\"11\">GPU monitoring and capacity planning<\/h2>\n<p>Considering the number of firstborns we must sacrifice to the Machine God to afford a GPU, <strong>everything above is more than justified<\/Strong>. Either we monitor, or the most critical point becomes a blind spot.<br \/>\nBut <strong>it is in capacity planning where monitoring translates into hard cash<\/strong>.<br \/>\nWith historical data on utilization, memory, temperature and power consumption, it is possible to detect GPUs that are recurrently saturated, identify underused ones, redistribute workloads and justify purchases to the procurement department with data, instead of hunches and pleas, since we will be able to measure the real growth in demand.<br \/>\nIn the end, it is <a href=\"https:\/\/pandorafms.com\/en\/it-topics\/data-science-it\/\" target=\"_blank\" rel=\"noopener\">data science applied to IT<\/a> and turning the past into decisions.<\/p>\n<h2 id=\"12\">GPU monitoring as part of AI infrastructure monitoring<\/h2>\n<p>People usually talk about GPUs because, these days, they have climbed to the top of many IT infrastructures, <strong>but they do not play the game alone<\/Strong>.<br \/>\nAn AI infrastructure also depends on CPU, RAM, disk, storage, network, APIs, services, logs, processes, containers, databases and overall availability.<br \/>\nThat is why <strong>GPU monitoring must be part of a broader AI infrastructure monitoring strategy<\/strong>.<br \/>\nWatching only the GPU is like piloting the Enterprise while paying attention only to what is happening in the antimatter engine, while ignoring shields, life support or even the course.<br \/>\nThat is where Pandora FMS becomes that legendary ship computer that knew everything.<br \/>\nPandora integrates that visibility into a complete IT operation, where <a href=\"https:\/\/pandorafms.com\/en\/ai-it-management-smart-monitoring\/\" target=\"_blank\" rel=\"noopener\">AI applied to management<\/a>, <a href=\"https:\/\/pandorafms.com\/en\/it-topics\/generative-ai-it-management\/\" target=\"_blank\" rel=\"noopener\">generative AI<\/a> and <a href=\"https:\/\/pandorafms.com\/en\/it-topics\/deep-learning-monitoring-itsm\/\" target=\"_blank\" rel=\"noopener\">deep learning<\/a> provide the analysis that, let us admit it as a species, no human could sustain manually, nor would they want to, because there is no need.<br \/>\nThis is especially relevant in <a href=\"https:\/\/pandorafms.com\/blog\/intelligent-data-centers-2\/\" target=\"_blank\" rel=\"noopener\">data centers<\/a>, where availability is taken for granted until it is no longer there, as any good <a href=\"https:\/\/pandorafms.com\/en\/it-topics\/uptime-monitoring\/\" target=\"_blank\" rel=\"noopener\">uptime monitoring<\/a> strategy knows well.<\/p>\n<h2 id=\"13\">Conclusion<\/h2>\n<p>We have covered a lot, but let\u2019s focus on three ideas to take home.<\/p>\n<ul class=\"lista\">\n<li>That <strong>GPUs are critical resources<\/Strong> in AI and HPC infrastructures, not a technical detail.<\/li>\n<li>That <strong>monitoring them in isolation is not enough<\/Strong>, since a figure without context is noise.<\/li>\n<li><strong>That the value lies in integrating them into the global monitoring of the infrastructure<\/Strong>, where a GPU metric talks to servers, services, networks, logs and alerts.<\/li>\n<\/ul>\n<p>NVIDIA tools provide the reading, <Strong>but a command bridge is needed to interpret it<\/Strong>.<br \/>\nPandora FMS aims to be that bridge to bring GPU monitoring to on-premise, hybrid and enterprise environments. Because in the gaps of IT live the gremlins of the machine, <Strong>the ones that cause trouble precisely where you are not watching<\/Strong>.<\/p>\n<h2 id=\"14\">Frequently asked questions<\/h2>\n<p>Let\u2019s compile the most important questions about GPU monitoring and their answers.<\/p>\n<h4>What is GPU monitoring?<\/h4>\n<p>GPU monitoring is <strong>the continuous supervision of the status, utilization, memory, temperature, power consumption and errors of Graphics Processing Units in professional environments<\/Strong>.<br \/>\nIt is used to control GPUs used in AI, HPC, inference, model training or intensive processing.<br \/>\nAs I mentioned at the beginning, it should not be confused with gaming, overclocking or graphic tuning tools.<\/p>\n<h4>Why is it important to monitor GPUs in AI infrastructures?<\/h4>\n<p>Because GPUs are what keep the heart beating and, moreover, <Strong>very expensive and critical resources<\/Strong> for inference, training and machine learning workloads.<br \/>\n<Strong>Without monitoring, we are blind in our most critical asset<\/Strong>, leaving it defenseless against saturation, overheating, errors or underutilization, which have a habit of attacking from the shadows and gaps we leave unwatched.<br \/>\nThis increases operational risk and makes it harder to get new hardware investments approved by the finance department.<\/p>\n<h4>Which metrics should be monitored on a GPU?<\/h4>\n<p>The main ones are utilization, used, free and total memory, percentage of memory used, temperature, instant power consumption, power limit, fan speed, status, critical ECC errors, GPU model, driver version and CUDA version.<br \/>\nAlmost nothing, but isolated they give us an unfinished puzzle, so <Strong>in professional environments, these metrics must be analyzed together with the rest<\/Strong> of the infrastructure.<\/p>\n<h4>What is the difference between nvidia-smi, DCGM and a monitoring platform?<\/h4>\n<p><span style=\"color:green; font-family:Pandocode;\">nvidia-smi<\/span> is a command-line utility for querying specific metrics from NVIDIA GPUs.<br \/>\nDCGM is aimed at GPU management and monitoring in data centers and clusters.<br \/>\nA platform such as Pandora FMS <Strong>centralizes those metrics together with the rest of the infrastructure<\/Strong>, with history, alerts, dashboards and reports.<\/p>\n<h4>At what temperature is a server GPU considered critical?<\/h4>\n<p>Based on our experience embedded in the Pandora FMS GPU monitoring plugin, <Strong>a temperature from 70 \u00b0C is considered warning and from 90 \u00b0C, critical<\/Strong>.<br \/>\nThese thresholds serve as an operational reference, although they may vary depending on the GPU model, manufacturer and thermal conditions of the environment.<\/p>\n<h4>What are ECC errors in a GPU and why do they matter?<\/h4>\n<p>ECC (Error-Correcting Code) errors are <Strong>memory errors detected in GPUs with ECC support<\/Strong>.<br \/>\nUncorrected errors may indicate hardware failures and must be treated as critical incidents.<br \/>\nThey are especially relevant in data center GPUs such as A100, H100, Tesla or RTX PRO. In consumer-grade GPUs, the module is still issued, but with a value of 0 in the Pandora FMS plugin, because there are no ECC errors to count.<\/p>\n<h4>Can GPUs be monitored in on-premise environments with Pandora FMS?<\/h4>\n<p>Yes. <Strong>Pandora FMS has a local agent plugin based on <span style=\"color:green; font-family:Pandocode;\">nvidia-smi<\/span> that allows NVIDIA GPUs to be monitored on on-premise servers<\/Strong>.<br \/>\nIt can also be used in hybrid environments or cloud instances if the NVIDIA GPU is exposed to the operating system and the driver is correctly installed.<\/p>\n<h4>How does GPU monitoring help with capacity planning?<\/h4>\n<p>Historical utilization, memory, temperature and power consumption data makes it possible to identify <Strong>GPUs that are recurrently saturated, detect underutilization and redistribute<\/Strong> workloads.<br \/>\nIt also <Strong>helps justify hardware purchases<\/Strong> or postpone unnecessary expansions with objective usage and capacity data in hand.<\/p>\n<h4>What happens if nvidia-smi is not installed on the server?<\/h4>\n<p>If <span style=\"color:green; font-family:Pandocode;\">nvidia-smi<\/span> is not available, the Pandora FMS plugin issues the <span style=\"color:green; font-family:Pandocode;\">GPU_Count<\/span> module with a value of 0 in critical status.<br \/>\nThis <Strong>allows an alert to be generated without breaking the agent execution or producing invalid XML<\/Strong>.<br \/>\nThe issue remains visible as the absence of GPU or a failure in the availability of the data source.<\/p>\n<h4>Is GPU monitoring enough to manage an AI infrastructure?<\/h4>\n<p><Strong>No<\/Strong>. It is part of a broader whole.<br \/>\nThe GPU is a critical component, but <Strong>an AI infrastructure also depends on CPU, RAM, storage<\/Strong>, network, services&#8230;<br \/>\nThat is why GPU monitoring must be integrated into a broader AI infrastructure monitoring strategy. Without that, it will be impossible to have a complete operational view.<br \/>\n[\/et_pb_text][et_pb_button button_url=&#8221;@ET-DC@eyJkeW5hbWljIjp0cnVlLCJjb250ZW50IjoicG9zdF9saW5rX3VybF9wYWdlIiwic2V0dGluZ3MiOnsicG9zdF9pZCI6IjM2MjI3MCJ9fQ==@&#8221; button_text=&#8221;\u2190 Back to IT Topics&#8221; button_alignment=&#8221;left&#8221; _builder_version=&#8221;4.22.0&#8243; _dynamic_attributes=&#8221;button_url&#8221; _module_preset=&#8221;default&#8221; custom_button=&#8221;on&#8221; button_text_size=&#8221;1em&#8221; button_text_color=&#8221;#0C312F&#8221; button_bg_color=&#8221;#FFFFFF&#8221; button_bg_color_gradient_direction=&#8221;90deg&#8221; button_bg_color_gradient_stops=&#8221;#82B92E 0%|#3CB92E 100%&#8221; button_bg_color_gradient_start=&#8221;#82B92E&#8221; button_bg_color_gradient_end=&#8221;#3CB92E&#8221; button_border_width=&#8221;1px&#8221; button_border_color=&#8221;#eaeaea&#8221; button_border_radius=&#8221;100px&#8221; button_use_icon=&#8221;off&#8221; z_index=&#8221;0&#8243; custom_margin=&#8221;60px||0px||false|false&#8221; custom_padding=&#8221;10px|50px|10px|50px|true|true&#8221; custom_padding_tablet=&#8221;&#8221; custom_padding_phone=&#8221;10px|20px|10px|20px|true|true&#8221; custom_padding_last_edited=&#8221;on|phone&#8221; custom_css_main_element=&#8221;right:0!important;||font-family:%22Pandora-Bold%22!important;&#8221; global_module=&#8221;367749&#8243; locked=&#8221;off&#8221; global_colors_info=&#8221;{}&#8221; button_bg_color__hover_enabled=&#8221;on|desktop&#8221; button_bg_color_gradient_start__hover=&#8221;#eaeaea&#8221; button_bg_color_gradient_end__hover=&#8221;#f4f4f4&#8243; button_bg_color__hover=&#8221;#eaeaea&#8221; button_bg_enable_color__hover=&#8221;on&#8221; button_bg_use_color_gradient__hover=&#8221;on&#8221; button_bg_color_gradient_stops__hover=&#8221;#eaeaea 0%|#f4f4f4 100%&#8221;][\/et_pb_button][\/et_pb_column][\/et_pb_row][\/et_pb_section][et_pb_section fb_built=&#8221;1&#8243; custom_padding_last_edited=&#8221;on|desktop&#8221; admin_label=&#8221;Final CTA&#8221; _builder_version=&#8221;4.27.0&#8243; _module_preset=&#8221;default&#8221; background_color=&#8221;#161327&#8243; use_background_color_gradient=&#8221;on&#8221; background_color_gradient_stops=&#8221;rgba(22,19,39,0.5) 17%|rgba(22,19,39,0.5) 100%&#8221; background_color_gradient_overlays_image=&#8221;on&#8221; background_image=&#8221;https:\/\/pandorafms.com\/wp-content\/uploads\/2023\/12\/banner-contacta-it-topics.webp&#8221; background_size=&#8221;custom&#8221; background_image_width=&#8221;121%&#8221; background_image_height=&#8221;192%&#8221; background_position=&#8221;top_left&#8221; z_index=&#8221;1&#8243; max_width=&#8221;1080px&#8221; max_width_tablet=&#8221;98%&#8221; max_width_phone=&#8221;98%&#8221; max_width_last_edited=&#8221;on|tablet&#8221; module_alignment=&#8221;center&#8221; custom_margin=&#8221;80px||80px||true|false&#8221; custom_padding=&#8221;40px|20px|120px|20px|false|true&#8221; custom_padding_tablet=&#8221;40px|0px|120px|0px|false|true&#8221; custom_padding_phone=&#8221;40px|0px|120px|0px|false|true&#8221; scroll_scaling=&#8221;40|55|85|100|100%|120%|100%&#8221; motion_trigger_start=&#8221;top&#8221; background_last_edited=&#8221;on|phone&#8221; background_size_tablet=&#8221;cover&#8221; background_size_phone=&#8221;cover&#8221; background_position_phone=&#8221;top_center&#8221; border_radii=&#8221;off|20px|20px|20px|20px&#8221; border_color_all=&#8221;#ffffff&#8221; box_shadow_style=&#8221;preset1&#8243; box_shadow_vertical=&#8221;0px&#8221; box_shadow_blur=&#8221;80px&#8221; box_shadow_color=&#8221;#506da0&#8243; saved_tabs=&#8221;all&#8221; global_colors_info=&#8221;{}&#8221;][et_pb_row use_custom_gutter=&#8221;on&#8221; gutter_width=&#8221;2&#8243; make_equal=&#8221;on&#8221; _builder_version=&#8221;4.22.0&#8243; _module_preset=&#8221;default&#8221; max_width=&#8221;750px&#8221; module_alignment=&#8221;center&#8221; custom_margin=&#8221;0px||0px||true|false&#8221; custom_padding=&#8221;0px|0px|0px|0px|true|true&#8221; global_colors_info=&#8221;{}&#8221;][et_pb_column type=&#8221;4_4&#8243; _builder_version=&#8221;4.22.0&#8243; _module_preset=&#8221;default&#8221; global_colors_info=&#8221;{}&#8221;][et_pb_text content_tablet=&#8221;<\/p>\n<p class=%22h2-w%22>Habla con el equipo de ventas, pide presupuesto, o resuelve tus dudas sobre nuestras licencias<\/p>\n<p>&#8221; content_phone=&#8221;<\/p>\n<p class=%22h2-w%22>Habla con el equipo de ventas, resuelve tus dudas sobre nuestras licencias o pide presupuesto<\/p>\n<p>&#8221; content_last_edited=&#8221;on|tablet&#8221; _builder_version=&#8221;4.22.0&#8243; _module_preset=&#8221;default&#8221; header_2_font_size=&#8221;2em&#8221; text_orientation=&#8221;center&#8221; module_alignment=&#8221;left&#8221; custom_margin=&#8221;0px||20px||false|false&#8221; custom_padding=&#8221;0px||0px||true|false&#8221; global_colors_info=&#8221;{}&#8221;]<\/p>\n<p class=\"h2-w\">Habla con el equipo de ventas, pide presupuesto, <br \/>o resuelve tus dudas sobre nuestras licencias<\/p>\n<p>[\/et_pb_text][et_pb_code _builder_version=&#8221;4.27.4&#8243; _module_preset=&#8221;default&#8221; width=&#8221;200px&#8221; module_alignment=&#8221;center&#8221; custom_margin=&#8221;40px||0px||false|false&#8221; custom_padding=&#8221;0px||0px||true|false&#8221; locked=&#8221;off&#8221; global_colors_info=&#8221;{}&#8221;]<\/p>\n<div class=\"doblebtn\"><!-- [et_pb_line_break_holder] --><a class=\"prices-2024-btn\" style=\"padding:10px 30px!important\" href=\"https:\/\/pandorafms.com\/en\/contact\/\">\u00a1Contacta ahora!<\/a><\/div>\n<p>[\/et_pb_code][\/et_pb_column][\/et_pb_row][\/et_pb_section]<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Sections What GPU monitoring is Why GPU monitoring matters On-premise, hybrid and enterprise environments Key metrics in GPU monitoring nvidia-smi, DCGM and monitoring platforms The risks of not monitoring GPUs From isolated metrics to operational correlation How Pandora FMS helps with GPU monitoring Thresholds, alerts and critical statuses How Pandora FMS helps monitor databases Compatibility [&hellip;]<\/p>\n","protected":false},"author":33,"featured_media":424581,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_et_pb_use_builder":"on","_et_pb_old_content":"","_et_gb_content_width":"","_joinchat":[],"footnotes":""},"categories":[3505,7756],"tags":[],"class_list":["post-424589","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-it-topics","category-monitoring"],"_links":{"self":[{"href":"https:\/\/pandorafms.com\/en\/wp-json\/wp\/v2\/posts\/424589","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/pandorafms.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/pandorafms.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/pandorafms.com\/en\/wp-json\/wp\/v2\/users\/33"}],"replies":[{"embeddable":true,"href":"https:\/\/pandorafms.com\/en\/wp-json\/wp\/v2\/comments?post=424589"}],"version-history":[{"count":4,"href":"https:\/\/pandorafms.com\/en\/wp-json\/wp\/v2\/posts\/424589\/revisions"}],"predecessor-version":[{"id":424782,"href":"https:\/\/pandorafms.com\/en\/wp-json\/wp\/v2\/posts\/424589\/revisions\/424782"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/pandorafms.com\/en\/wp-json\/wp\/v2\/media\/424581"}],"wp:attachment":[{"href":"https:\/\/pandorafms.com\/en\/wp-json\/wp\/v2\/media?parent=424589"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/pandorafms.com\/en\/wp-json\/wp\/v2\/categories?post=424589"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/pandorafms.com\/en\/wp-json\/wp\/v2\/tags?post=424589"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}