Troubleshooting a problem is a great time to remember that the network is at the bottom of the IT stack. This fact is not meant to suggest that the network should be viewed as less important than other components in the stack. Quite the contrary, being at the bottom means it is the foundation for everything. The network is also in the unique position of being able to view a problem from end-to-end.
How May I Be of Service?
When problems arise, it is second nature for people to seek quick exoneration for their systems. This behavior is as true of application developers, database administrators and systems administrators as it is for storage and network administrators. Accusing someone’s system of being the root cause of a problem can have the same effect as calling their baby ugly. Curt communications may be the best for which you can hope.
The do-more-with-less culture, pervasive in today’s corporate world, means the same human resources used to keep the lights on are the same resources developing and implementing new features. Lowering the mean time to innocence (MTTI) leaves more time to work on enhancements and change requests that move the business forward. The difficulty with trying to prove innocence is that problems get passed between teams like a hot potato and root cause is often not discovered. This behavior leads to repeating issues and less time to help the business.
All this to say that when your skills are called upon to troubleshoot a problem, do it to the best of your ability. When the network is blamed for a problem rather than being politely asked to check if it can help fix a problem, focus on supporting and being of service. Don’t focus on the blame. Rather than assuming the network is innocent, exhaust your troubleshooting methodology until the root cause is found somewhere in the system. It will take time, but this level of effort and caring will build trust and credibility.
Planning the Attack
You should always ask these two questions when confronted with a problem for the first time:
- Did the system ever work?
- Did anything change?
The answers to these questions will help target the attack.
It is crucial to have a repeatable, adaptable troubleshooting methodology when attacking complex problems. The following are just a few examples of dozens documented across many industries.
- The OODA Loop: Observe, Orient, Decide, Act was developed by Colonel John Boyd of the United States Air Force.
- PDCA: Plan, Do, Check, Act is a management methodology developed by Toyota to support its lean manufacturing process.
- GTD: Capture, Clarify, Organize, Reflect, and Engage are the fundamentals of David Allen’s Getting Things Done productivity methodology.
While these examples are not specific to troubleshooting, they are simple and have common elements that allow for easy iteration. These elements can be summarized into the following rudimentary steps:
- Observe and document symptoms
- Develop a theory on where the likely root cause is or where better observations can be made
- Decide on the best course of action
- Rinse and repeat until the problem is fixed
Divide and Conquer
In this network operations post, I discussed the importance of network visibility tools. Troubleshooting problems is where these tools can shine. If network visibility was built-in from day one, the tools might have captured the symptoms already, allowing for immediate analysis. If not, collecting and analyzing evidence may have to wait until the problem occurs again. Waiting for tools to be installed and configured can extend problem resolution for intermittent issues.
Looking at the figures below, it should be easy to see that having more than one observation point in the network makes it easier to pinpoint the problem. Figure 1 has a single observation point. Depending on the symptoms, it may be hard to tell if the problem is to the right or to the left of this point. Figure 2 displays two observation points. Depending on the symptoms, it should be easier to tell if the problem is to the right of both points, to the left of both points, or somewhere in between. Adding more observation points shrinks the search area more quickly.
We’re All in this Together
The goal of IT is to help make the business the best it can be. When problems occur in systems that support the business, all divisions of Information Technology need to step up and be part of the solution. For many reasons, this doesn’t always happen. In any situation, we can only control ourselves. Individual positivity, responsiveness, openness, and a sense of service go a long way to building credibility, trust and a high performing team.
Read My Other Blogs in the Back-to-Basics Series