Upcoming Pandora FMS Workshop: June 11. More information →

Reliance on Key Technicians in an MSP: How to Reduce It

In Star Trek, whenever Captain Kirk needed something impossible, he always turned to Scotty. Only Scotty knew how to get the most out of the dilithium crystals, only he understood the Enterprise’s inner workings, and only he could work the saving miracle under the wire, with everything against him. Something brilliant for television, but a disaster as an operating model. Because if Scotty caught Klingon flu, the entire ship would drift adrift.

That same thing happens in too many MSPs. We have our own Scotty — that brilliant technician capable of carrying half the client portfolio on his shoulders — and we may even celebrate it as a strength because we have an inimitable genius on the payroll. But as we will see, it is actually a single human point of failure that prevents us from scaling safely.
Technical specialization is valuable, that is undeniable, but when it crosses the boundary into excessive dependence on specific individuals, it ceases to be so. And the difference between specialization and dependence is an operational design flaw, not a talent or human resources problem.

What it really means to depend on indispensable technicians

We suffer from operational dependence when the day-to-day functioning of the MSP revolves around specific individuals, rather than shared processes.
That is the fundamental premise, because if critical knowledge lives in the heads of those geniuses — and not in the knowledge base or distributed among other technicians — our daily life tends to be made up of:

  • Clients that only one person “understands.” Because no one else has touched that infrastructure, or knows where to begin, since it has never been documented in the internal wiki.
  • Critical maintenance tasks that are also undocumented anywhere, except in the memory of whoever has been performing them for the past three years.
  • Incidents that are systematically escalated to the same technician. Not because that is the procedure, but because our particular Scotty is the only one who knows how to resolve them.
  • Scattered operational knowledge. Living in WhatsApp chats, mental notes, post-its, and that notebook Laura always carries around — whose arcane contents appear in no shared knowledge base.

If any of this defines us, we are not special no matter what we were told at school. This is a structural pattern that repeats insistently across many MSPs, and one that we must address before it blows up in our faces.

Signs that an MSP already has this problem

Humans have a masterful skill — adaptation — which works against us here, because what makes dependence dangerous is that it becomes normalized.
Like the frog in boiling water, we grow accustomed without doing anything until the water is boiling around us.
However, there are clear symptoms that we need to jump to a different way of doing things, if we take the trouble to look honestly and stop burying our heads in the sand:
The main ones are:

  • Tickets blocked for hours or days until a specific technician comes on shift or returns from vacation.
  • Onboarding of new profiles that drags on endlessly, because the knowledge needed to do it is not written down and shared. It remains trapped in informal conversations and the accumulated experience of veterans.
  • Vacations or sick leave that slow down operations and generate nervousness within the team.
  • Clients or environments that no one can take on with confidence, except the usual people, turning every absence into a game of Russian roulette with the SLA.
  • Documentation that exists but is so generic or outdated that it is useless for day-to-day operations.
  • Chronic dependence on informal history. “Ask Marcos, he’s the one who set that up.” If we hear phrases like this often from our office, we have a problem.

The impact of dependence on SLAs, scalability, and profitability

Dependence on indispensable technicians is a spanner in the works of daily operations, but it is also an operational drain that directly affects the business and its results.

The bottlenecks

If only the genius of the moment can resolve certain types of incidents, resolution time depends on their availability, when it should correlate with the severity of the problem.
When this happens, the Mean Time to Repair (MTTR) skyrockets for reasons that have little to do with technical complexity, and much more to do with knowledge being locked behind seven keys inside a head that could be offered a better salary elsewhere tomorrow.

Unnecessary escalations

Level-1 technicians who could perfectly well resolve an incident keep passing it on to the “expert,” because no one has given them the context they need to solve it.
Again, it is not a matter of lacking capability in those junior staff members, but rather an erroneous design of the MSP’s operations.
That bogs down senior staff with low-value tasks. Or as I have mentioned before, we use Ferraris to go to the supermarket — they consume too much and then aren’t available for what actually matters.

Inability to absorb workload spikes or absences

If two people hold 80% of critical knowledge and one falls ill while the other is on vacation, the operation wobbles like a house of cards.
And that is without factoring in staff turnover. Because when one of those geniuses decides to move on to greener pastures, they tear away a chunk of our operational capacity that leaves us badly wounded.

The impossibility of scaling robustly

Because every new client increases the load on the same shoulders as always, and strong backs are not common in the IT trade.
This is not how we grow — we bloat and become more fragile. We have a bigger castle, but it is made of glass. Thus, operational efficiency degrades with every new contract we toast with champagne at the start, only to manage later with unhealthy doses of stress.

What are the levers to reduce our operational dependence

We now know the enemy, as Sun Tzu preached in The Art of War — now let us find solutions, but with the usual caveat that there are no magic wands in real life.
What there are, however, are clear levers we must pull.
The general framework for optimal operations rests on four pillars:

  • Useful, living operational documentation. Which does not mean the classic fossilized wikis nobody reads. We are talking about documentation that reflects how things are actually operated today — updated and accessible.
  • Standardization of processes and environments. If every client is a special snowflake, it sounds like a compliment, but dependence on whoever knows each snowflake becomes inevitable. Standardization breaks that chain.
  • Automation of repetitive or sensitive tasks. What a machine can do reliably should not depend on a human’s memory — a memory that is also storing the mortgage, the kids, and the bad days.
  • Shared visibility of the technical context and the status of services. If the entire team sees the same thing in the NOC, knowledge stops being a secret and becomes accessible data that serves to make optimal decisions and operate like clockwork.

Now, the first task is to take a walk around our house and take an honest look at those four columns. Which one wobbles the most? Which one has cracks? Which one is a makeshift scaffold held together with tape, rather than a pillar?

How to reduce dependence without losing technical quality

All the theory I have developed above is necessary for that “knowing the enemy,” but without a practical and progressive approach it remains something that only looks good in a PowerPoint.
So, once we have reviewed the temple’s pillars, let us get down to the trenches with an applicable step-by-step process.

1. Identify where knowledge is concentrated

This is as simple an exercise as asking: “If this person mysteriously disappears tomorrow, what breaks?”
The answers tend to be revealing — and occasionally terrifying.

2. Map out critical and repetitive tasks

Not all technical dependencies carry the same weight. We must distinguish between what is critical to the service and what is simply done that way because “so-and-so has always handled it.”
If it is just a matter of habit and not something critical, it goes to the bottom of the priorities within that map. And speaking of priorities…

3. Document the most critical and frequent things first

Voltaire said that “the perfect is the enemy of the good,” and although he did not know it, he was also talking about technical documentation.
We do not need a perfect, fully complete knowledge base for it to be useful. Let us start by feeding into it what would cause the most damage if lost, and what recurs most often in day-to-day operations.
That includes something complex: getting the geniuses of the moment — with heads full of wisdom — to pour it out for everyone in a pedagogical way.
Here another key aspect that contributes to dependence comes in: Many of those technicians will be reluctant. And even more so in these times of uncertainty, where a large part of that dependence stems from technicians who want to make themselves indispensable, whether for job security, status, or other reasons.
With great tact, we must extract that knowledge from those neurons and pour it into a living knowledge base.

4. Standardize before automating

Now that we are sold on machines doing everything, we rush to apply automation patches before understanding what we want to achieve.
However, automating a chaotic process only produces chaos faster.
The solution is to first make the optimal or expected behavior of a process crystal clear, and only then let machines execute it.

5. Validate that the rest of the technicians can work and resolve issues without constant support

It is no use having the Library of Alexandria if no one can then follow the documentation from step 3 without needing the usual person.
The real proof that we do not merely have a pretty wiki is that a technician not called Scotty resolves the problem autonomously. If they cannot, the documentation is not ready, however complete it may appear.

6. Review periodically

Technical dependencies in organizations regenerate like weeds.
New clients, new technologies, new habits… We must audit them regularly to prevent bad habits from crystallizing again.

How Pandora FMS helps reduce operational dependence in an MSP

Pandora FMS was born in the trenches of the practical and the experienced. That shows in how it tackles precisely this problem of dependence, correcting what we have verified, over more than twenty years, to be its root causes.
Thus, the centralized visibility of Pandora FMS allows any technician on the team to see the complete status of any client from a single location — our Metaconsole.
There is no need to bother the genius on vacation to find out what is happening, nor to dig through last year’s chats. The context is right there, in an infrastructure monitoring solution accessible to everyone.
For its part, the inventory and shared context eliminate the need for anyone to memorize the client’s infrastructure. What is there, how it is configured, what its event history looks like… With Pandora FMS, everything is recorded and available in the system.
Reusable templates allow operational knowledge to be codified once and applied to all similar clients. In that way, what used to live inside a technician’s head now lives in a policy that anyone can deploy.
The operational automation built into Pandora FMS handles repetitive tasks in turn: self-healing, automatic escalations, predefined responses to known conditions… If the system knows what to do, no one needs to be woken at three in the morning.
On the other hand, event and alert traceability provides an auditable record of what happened, when, and what was done about it. This means that a newly arrived novice, or a technician covering for someone else, can quickly understand the history without depending on neurons that have gone on vacation.
And in environments where security forms part of the service, Pandora SIEM complements this vision with security event correlation and unified visibility, in line with the best practices of ENISA and other reference frameworks.
The central idea of Pandora FMS is simple:
Let operations depend on the system and its processes, not on Scotty working yet another miracle — because I am afraid life is not a TV series, much less a utopia.
A mature MSP must cultivate talent without generating dependence, thereby ensuring a continuity of service that does not rely exclusively on a handful of individuals, however brilliant they may be.
I insist that individual brilliance is both admirable and an unacceptable single point of failure when there are SLAs to meet and clients who trust us — which is why premises such as the pursuit of the famous 10x engineer are a double-edged sword.
Furthermore, an indispensable condition for genuine scalability is liberating knowledge from the minds of chosen individuals, and placing it safely within the processes, tools, and common knowledge of the MSP.
That way, we will be able to take on more clients, prevent reactive support from consuming the margin, and have no need for heroes. After all, it has been proven many times that a good team always outperforms a handful of superstars each fighting their own private battle.
That is what we want, because with technical dependence we will have an adventure every day — but those only look good in fiction.

Habla con el equipo de ventas, pide presupuesto,
o resuelve tus dudas sobre nuestras licencias