The problem with Problem Management- Key to Implementing This Discipline in A “Psychologically Safe” Culture.
Problem Management isn’t a new process or concept, in fact it’s been around (at least in the ITIL world) for 20+ years. In my professional career, this is the 3rd organization that I’m actively implementing this discipline and looking to gain maturity with it. In its simplest form, it aims to get at the root cause of incidents, not just to restore service with any means. But to do that, it means you have to get to the heart of the matter, which isn’t easy because it requires many things, including transparency, cooperation, and a level of zeal from multiple teams to get to and eradicate (strong word, but effective) the problem by all means possible. That’s if it’s worth eliminating, because sometimes, it isn’t, like when the cost is prohibitive, or when you’re about to migrate off of the platform or application, or if the cost of the outage is far less than the total cost of eliminating the root cause.
Here is what I’ve learned, from the many efforts in implementing problem management in different industries.
1) The number one incontestable element you need for this effort to be successful is a culture of psychological safety. That means an environment where technologists feel safe in discussing, debating, or arguing about the root cause, inherent issues in the environment, and disclosing sensitive information including their own areas of deficiency and/or negligence. If teams don’t feel that they can bring their whole self to the table, then the entire exercise may be not much more than a formality, a facade. People need to feel safe when they talk about real issues. Any fear of retaliation, career cul-de-sac, and even the simplest judgement will sabotage the effort. I lead with this one as the most critical recommendation because without this, you may as well call it a day. Without this you’ll be rolling a rock up hill all the way and every day. Without this, you’re spinning your wheels and getting nowhere…you get the picture.
2) Have a problem “manager” for governance. And don’t rotate this responsibility, it needs to be a consistent individual responsible for leading this effort, or a member of the process management (governance) team, highly skilled in leading a root cause analysis effort. Even early on,someone focused and willing to learn this is better than many who are assigned to do this on/off.
3) Discipline and regularity in approach to uncovering the truth. In some places I’ve seen teams use the Five Why’s methodology, in other places, it’s an Ishikawa diagram or a Pareto analysis. Whatever method you decide to use, follow it, and make sure you have action items, and that they are logged in a ticketing system (if you don’t have one, then Sharepoint, OneNote, Teams, etc.). The action items need owners, due dates, and follow up tasks up to and including escalating to assignee’s management for prioritization and traction.
4) Actionable “Change” to eliminate the problem as a priority. Rarely will the problem be discovered and expected to go away on its own. There usually is at least one if not a series of changes necessary. This can be infrastructure-related, requiring change management approval of replacement, supplementing of hardware or software change, or it can be procedural changes, training, communication, etc. In the odd situation you discover a root cause but opt not to remove it, documenting what it was and the reason for accepting the status quo is important.
5) Reporting progress and celebrating success! Yes, it had to be said this way. First you report on how much the team has been doing, the progress in eliminating root cause and closing out problems, which ultimately leads to a successful effort. This becomes everyone’s success. Once team members feel that they are actually chiseling away at problems and eliminating the root causes, they’ll begin to see how many fewer major incidents/outages or recurring issues they face and have to tackle reactively. They start to get excited about getting time and bandwidth back, and they become extremely protective of their environment and guarding the problem management effort.
I know that this sounds rather easy, the way I’ve laid it out. It is, and it isn’t. To do all this, team members have to start from a place of TRUST. They have to Trust one another, their management team, in order to trust the process. Asking them to trust the process first is generally not going to work. It’s not something intuitive enough, unless you’re already in a place and environment where people feel safe and not fear at every corner. I’ve witnessed the best and the worst of this effort in my time, seen individuals shuddering in feel in major incident/post mortem reviews, and for good reason, when those meetings ended costing them disciplinary action. It really shouldn’t come to that, unless there’s pure negligence or devious activity uncovered (and generally there isn’t.)
So before tackling problem management purely as a process with an end goal of stabilizing the environment, or better yet, improving the “customer experience”, do the right thing…begin by asking yourself the question, does the IT team feel safe enough together to do this? Do you have a culture of psychological safety? Will they come together and build something amazing? If the answer is yes- then go for it! But if the answer is maybe or downright no, then start there. Lay down the grown rules for trust and transparency first. That’s the bedrock for this process. Without it, it’s doomed to spin around itself in perpetuity…and point back at the real problem is management, not problem management.
Some resources to help on the topic of Psychological safety:
https://www.ccl.org/articles/leading-effectively-articles/what-is-psychological-safety-at-work/
Some resources to help with problem management:
https://www.proprofsdesk.com/blog/itil-problem-management-lifecycle/
https://agilephoria.com/news/problem-solving-workshop-step-by-step/