Troubleshooting Systems Management or “Is my kid peeing in the pool?”


The problem may be you

Photo by Matt Graves

We were deploying Tivoli Management Agent (TMA) out to about 15,000 computers in 2001. There were about 170 administrators that were responsible for all these systems. There was A LOT of resistence from those administrators about adopting a centralized management system like Tivoli Framework. I will never forget one of the messages we received from an administrator who clearly had no concept of proper troubleshooting. The message, in part, read:

“We cannot install the TMA on our 230 computers because it breaks Oracle”.

I replied with a request for her to be MUCH more specific. Her response was priceless:

“We feel that the TMA may be causing the Oracle client to lose connection to the Oracle database.” (Italics added to demonstrate stupidity…I mean for emphasis.)

May be causing …?!” To paraphrase B.J. Honeycutt: “So far I have a definite possibility of three absolute maybes”.

A co-worker of mine responded with the famous flea/cricket story used to demonstrate improper troubleshooting and to call her out on her text book use of the “post hoc ergo proctor hoc” logical fallacy.

A scientist taught a flea to jump on command. Out of curiosity he thought he would do some experiments with his trained flea. “Jump!” he yelled,  and the flea jumped six inches into the air. The scientist then pulled off two of the flea’s legs and yelled “Jump!”. This jump was only four inches high. He ripped off two more legs and the jump was reduced to two inches. After the last two legs came off the flea didn’t jump anymore. The scientist then wrote a paper explaining how if you pull all the legs off a flea, the flea goes deaf.

Needless to say, this email didn’t go over well but that’s a story for another time.

My co-worker and I wrote another email detailing some things we wanted her to document for us.

  1. Basic info: OS, Oracle client version etc.
  2. How consistent are the errors? Are these errors present on multiple machines?
  3. Please duplicate the error and send us the relevant Event Log entries.
  4. Can you duplicate the error with the TMA service turned off?
  5. Are there any articles on IBM or Oracle support sites documenting these problems?

There were a few more suggestions but you get the point. We wrapped it up with an invitation to come to their department and perform these tests ourselves.

We never heard back. The TMA was installed as planned.

I know, trust me, how painful it can be to have customers or end-users make blanket accusations or knee-jerk explainations of their problems. The most common is probably “The network is down!” because they can’t get to a particular website or a print queue is backed up. The problem is that many of our management tools or configuration settings MAY be causing the problem that a user is experiencing. We have to remember that we have one or more swimmers in the public swimming pool that is a distributed computer environment. Maybe one of our swimmers is peeing in the pool. We can’t have a knee-jerk reaction to what is perhaps the user’s knee-jerk reaction because we may be at fault.

Keep the emotions in check, work the problem and save your frustrations and ranting for poker night or the occasional blog entry.

Follow me on Twitter @ShaneCorellian

Use PDQ Deploy to deploy software to your computers. It’s fully functional, fully free and 99.6% urine free.