Updating a critical monitoring platform

Contents:

The problem: Becoming blind when we most need to see
What makes a major monitoring upgrade different
What changes in Pandora FMS 800 LTS and why it requires proper migration planning
What determines risk: initial architecture, customizations, and dependencies
What good preparation before upgrading should include
What changes in our operations after upgrading
When to rely on official documentation, support, and available resources

Equality is the right approach when it comes to people, but when it comes to applications… we cannot treat them all the same way, especially when upgrading them. Doing the same with a monitoring platform as with a browser plugin (that you have been postponing for three weeks) is a recipe for disaster. Especially when the ingredients of that recipe are clicking the button, crossing your fingers, and leaving it up to the Machine God.
This is because monitoring is the nervous system of the environment, the layer that tells us something is wrong before the disaster becomes irreversible.
If during the upgrade process that system becomes degraded, partially blind, or directly unavailable, we have created the worst possible situation:
A moment of maximum operational risk with minimal response capacity.
That is why upgrading a critical infrastructure monitoring platform requires more than just clicking “Next” at each step of the installer. It is necessary to prepare as if it were a service continuity operation rather than a routine maintenance task.

The problem: Becoming blind when we most need to see

In technology, moments of change are also moments of vulnerability. It may sound dramatic, but it is true. When an application server is upgraded and something goes wrong, the impact is localized: that service goes down, it is rolled back and investigated, but the rest of the environment remains visible, monitored, and under control.
However, when the monitoring platform itself fails, or becomes partially operational during a major upgrade, the scenario changes completely.
The team loses visibility over the rest of the infrastructure at the exact moment it needs it the most: the moment when changes, restarts, and validations are being performed, leading to situations such as:

A possible failure in a remote node.
An alert that does not arrive.
A threshold exceeded without anyone noticing…

What makes a major monitoring upgrade different

There is an important difference between an incremental update and a migration involving architectural changes. A security patch or minor update does not change how components interact with each other.. You apply it, restart, and (with that bit of luck needed for everything in life) done. The system keeps working as before.
A major upgrade, however, may involve:

Changes in the internal architecture.
New components replacing previous ones.
Modifications in compatibility with the operating system or dependencies.
Adjustments in how processing tasks are distributed…

This may affect, for example, how distributed environments are configured, long-standing customizations and tweaks, or whether active integrations continue to work the same way or require review.
IT change management establishes that not all changes are equal in terms of risk and scope.
And a monitoring platform migration is, by definition, a high-impact change. Therefore, the criteria cannot be the same as for a standard low-risk modification.

What changes in Pandora FMS 800 LTS and why it requires proper migration planning

Pandora FMS 800 LTS Aquarius introduces significant changes that require proper migration planning, especially for those upgrading from 777 LTS Andromeda.
The main ones are:

1. The new server architecture

The most important change affects the internal server architecture, where functions have been redistributed among components:

The Network Server now consolidates tasks that previously required dedicated processes (WMI, remote scripts, web checks or prediction, for example).
The Network High Performance Server specializes in ICMP and SNMP polling at very short intervals.
The Heavy Server, as its name suggests, takes on the most resource-intensive workload: plugins, inventory, vulnerability management, and data export.

This simplifies the overall architecture and optimizes Pandora FMS operations, but in return it implies that environments that deployed specific servers for minor functions need to review how their configuration looks after the migration.

2. Pandora_supervisor

Another new component is pandora_supervisor, which acts as responsible for supervising and restarting the platform, making updates more transparent and reducing manual intervention.
Less manual intervention means less room for human error during the process.

3. Improved compatibility

Regarding compatibility, Pandora FMS 800 LTS expands support to RHEL 9 and PHP 8.4, while also improving SNMPv3 support.
For environments already running these versions, this is good news. For those that are not, it is necessary to verify that current dependencies dependencies will not become an issue.
For guidance, refer to Pandora FMS documentation, which recommends reviewing the upgrade guide from 777 LTS to 800 and paying special attention to environments with customized configurations or distributed deployments.
This is not the typical bureaucratic notice included for formality, it is a real technical warning. If you are upgrading from Pandora Andromeda 777 to Pandora Aquarius 800, review it to identify possible extra steps and details.

4. Other improvements and optimizations

For example, automatic load balancing in remote checks. This complements the system already in place for years, enabling HA (High Availability) in active/passive mode.
This HA can also be enabled or disabled for agents, increasing configuration flexibility in distributed environments.

What determines risk: initial architecture, customizations, and dependencies

The risk of upgrading a monitoring platform is not uniform. It can be seen as a journey, and that risk depends on how far the current environment is from what the new version requires.
The main factors influencing that risk are:

The starting version, obviously. Migrating from 777 LTS to 800 LTS means jumping over an entire cycle of changes. The older the source version, the more elements must be reviewed to avoid failure mid-transition.
The size and criticality of the environment. Size matters. Updating an infrastructure with two hundred monitored devices is not the same as one with fifty thousand, especially if strict SLAs do not allow operational blind spots.
Whether the deployment is centralized or distributed. Distributed environments, with remote nodes and satellite servers, require sequential planning. The central server cannot be upgraded without considering dependent nodes.
Customizations implemented. Custom scripts, in-house modules, integrations with external tools… This is a classic challenge, as these elements may be affected by internal architectural changes and must be explicitly validated.
System dependencies. Such as OS version, PHP, MySQL, SNMPv3 libraries… If any dependency is not compatible with the new version, it must be resolved before starting the upgrade, not during it.

Since the best battle is the one you do not fight, and as stated in The Art of War, it is won in preparation, IT risk mitigation starts with identifying these factors in advance.
Otherwise, we would be relying on luck—and we are not. So let’s go deeper into this.

What a good preparation before upgrading should include

For those who thought I wouldn’t reference Star Trek this time… lost bet. Because that series, like The Simpsons, contains and predicts everything.
In the episode Relics of The Next Generation, the legendary Scotty from the original series asks chief engineer La Forge how long he told the captain a task would take. He replies one hour, and Scotty asks how long it will actually take. La Forge, somewhat annoyed and confused, answers that same hour, which horrifies Scotty:
“How are you going to build a reputation as a miracle worker if you tell the captain the real time it will take?”, replies the iconic character.
The strategy, according to Scotty, is to inflate that time, a key principle we must apply here, because critical systems always present unexpected issues and those who do not anticipate them will suffer them.
In IT operations, the “Scotty Principle” is essential, and we must plan with time safety margins. Not only to look like geniuses (which might happen), but to have room to maneuver when inevitable gremlins appear, as they naturally do.
And if they don’t appear (they will, remember we are not the luckiest people in the world), great—we’ll look like the best engineer in Starfleet.
Based on this principle, proper preparation for a major monitoring platform upgrade should include, at minimum:

Preliminary environment review. This involves an inventory of deployed servers, dependency versions, documented customizations, active integrations… If this documentation does not exist or is outdated, creating it is the first step, not optional.
Full backup. Of the database, configuration files, scripts, and custom modules… The backup must be verified as functional and recoverable. Ideally, following the classic 3-2-1 principle.
Testing in a non-production environment, if possible. It may not always be feasible to replicate a critical production environment, but avoid taking unnecessary risks. Even partial testing in a lab environment can reveal incompatibilities before they reach production.
A clear rollback plan. No one likes to think things will go wrong, but as Andy Grove from Intel said: “Only the paranoid survive.” Before starting, define what actions will be taken if the upgrade fails: steps, timing (including the “Scotty margin”), and decision-makers. Crisis moments are not the time for improvisation.
An optimal maintenance window. Select a time with minimal operational impact, communicate it to stakeholders, and adhere to it.
Clear responsibilities. Define who executes, supervises, validates, and decides on rollback. Ambiguity in critical moments multiplies errors.
Post-upgrade validation. Ideally, use a checklist to confirm the platform is fully operational: critical modules active, alerts working, distributed environments synchronized, integrations functional, etc. Because the upgrade does not end when the installer says so.

What changes in our operations after upgrading

Unfortunately, organizational changes often confirm that “life goes on as usual.” This happens with meeting decisions, yearly resolutions, and sometimes after upgrades like this.
However, once the migration to the new Pandora FMS version is successfully completed, and if proper planning and dependency validation have been done, the environment (and the team) should not operate the same way as before, but better, because:

The new architecture of Pandora FMS 800 LTS reduces the number of dedicated processes, resulting in fewer failure points and lower maintenance workload for the team.
Pandora_supervisor makes future upgrades more transparent and less disruptive.
More frequent network polling improves the granularity of visibility.
Extended compatibility with RHEL 9 and PHP 8.4 removes technical debt related to dependencies.

When to rely on official documentation, support, and available resources

There are environments where the upgrade can be handled autonomously with sufficient preparation. However, there are others where that autonomy represents an unacceptable risk.
The cases where it makes the most sense to rely on external resources are:

Migrations from versions prior to 777 LTS, where accumulated changes are broader and the chances of incompatibility increase.
Complex distributed environments, with multiple nodes and/or custom configurations requiring a specific upgrade sequence.
Installations with critical integrations with third-party systems that do not tolerate interruptions or degraded periods.
Environments without up-to-date documentation of the actual platform state. Without knowing exactly what is deployed and how, any major migration becomes a risky endeavor.

The upgrade guide from 777 to 800 referenced above is the mandatory starting point in all cases, as it is designed to anticipate the most common incompatibility scenarios and guide the process in a structured way.
And of course, Pandora FMS official support exists precisely for these scenarios.
With it, a complex upgrade in a critical environment does not rely solely on the judgment of the internal team under pressure.
With our support—and as the famous football anthem says—you will never walk alone. Relying on that support is not a sign of technical weakness; on the contrary, it is a smart operational decision, ensuring that critical upgrades are carried out with proper backup.
Ultimately, upgrading versions is easy. Doing it without losing visibility, without compromising operations, and without accumulating technical debt is what separates a well-executed upgrade from one that simply didn’t fail—yet.
For all these reasons, a critical monitoring platform migration must be prepared as a service continuity operation with prior analysis, verified backup, rollback plan, planned maintenance window, and post-upgrade validation. It is not a maintenance task to be completed casually… unless the goal is to spend the following days fixing avoidable issues caused by lack of planning.

Isaac García

Siempre con un teclado entre manos, desde el primer ZX Spectrum que abrí de par en par para ver cómo funcionaba, la tecnología ha sido mi pasión y trabajo, de lo que hablo y lo que escribo.

Always with a keyboard in my hands, ever since I opened up my first ZX Spectrum wide to see how it worked, technology has been my passion and my work, what I speak about and what I write about.

What to consider before updating a critical monitoring platform

The problem: Becoming blind when we most need to see

What makes a major monitoring upgrade different

What changes in Pandora FMS 800 LTS and why it requires proper migration planning

1. The new server architecture

2. Pandora_supervisor

3. Improved compatibility

4. Other improvements and optimizations

What determines risk: initial architecture, customizations, and dependencies

What a good preparation before upgrading should include

What changes in our operations after upgrading

When to rely on official documentation, support, and available resources

SEARCH BLOG

Latest articles

The best network monitoring tools: an updated comparison

Logs and log management: what they are, types, examples and how to monitor them

ITSM-integrated CRM: support, customers and sales connected

Computer history timeline: key milestones that changed technology

The best databases: a comparison based on usage, performance and scalability

Blog categories